WO2022033048A1 - Video frame interpolation method, model training method, and corresponding device - Google Patents

Video frame interpolation method, model training method, and corresponding device Download PDF

Info

Publication number
WO2022033048A1
WO2022033048A1 PCT/CN2021/085220 CN2021085220W WO2022033048A1 WO 2022033048 A1 WO2022033048 A1 WO 2022033048A1 CN 2021085220 W CN2021085220 W CN 2021085220W WO 2022033048 A1 WO2022033048 A1 WO 2022033048A1
Authority
WO
WIPO (PCT)
Prior art keywords
video frame
optical flow
mapped
neural network
image
Prior art date
Application number
PCT/CN2021/085220
Other languages
French (fr)
Chinese (zh)
Inventor
黄哲威
衡稳
周舒畅
Original Assignee
北京迈格威科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京迈格威科技有限公司 filed Critical 北京迈格威科技有限公司
Publication of WO2022033048A1 publication Critical patent/WO2022033048A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/01Conversion of standards, e.g. involving analogue television standards or digital television standards processed at pixel level
    • H04N7/0135Conversion of standards, e.g. involving analogue television standards or digital television standards processed at pixel level involving interpolation processes
    • H04N7/014Conversion of standards, e.g. involving analogue television standards or digital television standards processed at pixel level involving interpolation processes involving the use of motion vectors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the present application relates to the technical field of video processing, and in particular, to a video frame insertion method, a model training method, and a corresponding device.
  • Video frame interpolation is a classic task in video processing, which aims to synthesize intermediate frames with smooth transitions from two frames before and after a video.
  • the application scenarios of video frame insertion include: first, it is used to improve the video frame rate displayed by the device, so that users can feel the video is clearer and smoother; second, in video production and editing, it is used to assist in realizing the slow motion effect of the video, or It is used to add intermediate frames between key frames of animation to reduce the labor cost of animation production; third, it is used for intermediate frame compression of video, or to provide auxiliary data for other computer vision tasks.
  • the video frame insertion algorithm based on optical flow is a kind of algorithm that has been studied more in recent years.
  • the typical method of using this kind of algorithm for frame insertion is to first train an optical flow calculation network, and use the network to calculate the optical flow between the frames before and after. Then, the optical flow between the front and rear frames is linearly interpolated to obtain the optical flow of the intermediate frame, and finally the intermediate frame is obtained based on the optical flow of the intermediate frame, that is, the frame to be inserted between the front and rear frames.
  • the purpose of the embodiments of the present application includes providing a video frame insertion method, a model training method and a corresponding device, so as to improve the above technical problems.
  • An embodiment of the present application provides a video frame insertion method, including: acquiring a first video frame and a second video frame; and calculating a first intermediate video frame based on the first video frame and the second video frame by using a first neural network The optical flow from the video frame to the first video frame and/or the optical flow from the first intermediate video frame to the second video frame; wherein, the first intermediate video frame is to be inserted into the first video frame and a video frame between the second video frames; using the optical flow of the first intermediate video frame to the first video frame to backward map the first video frame to obtain a first mapped video frame, and or, performing backward mapping on the second video frame by using the optical flow from the first intermediate video frame to the second video frame to obtain a second mapped video frame; according to the first mapped video frame and/or or the second mapped video frame determines the first intermediate video frame.
  • the first video frame and the second video frame are two frames before and after the video (which may or may not be two consecutive frames), and the above method is directly based on the first video frame and the second video frame when inserting frames.
  • using the first neural network to calculate the optical flow of the intermediate frame referring to the optical flow from the first intermediate video frame to the first video frame and/or the optical flow from the first intermediate video frame to the second video frame
  • the optical flow between the video frame and the second video frame so that the optical flow of the intermediate frame obtained is more accurate, and the image quality of the first intermediate video frame obtained on this basis is better, and it is not easy to generate at the edge of moving objects. ghosting.
  • the steps of the above method are simple, and the frame insertion efficiency is significantly improved, so that good results can also be achieved when applied to scenarios such as real-time frame insertion and high-definition video frame insertion.
  • the first neural network includes at least one optical flow calculation module connected in sequence, and based on the first video frame and the second video frame, the first neural network is used to calculate the first intermediate video
  • the optical flow from the frame to the first video frame includes: determining the first image input to each optical flow calculation module according to the first video frame, and determining the input to each optical flow calculation module according to the second video frame The second image of the module; using each optical flow calculation module, based on the optical flow input to the optical flow calculation module, backward mapping is performed on the first image and the second image input into the optical flow calculation module, based on the first image obtained by the mapping A mapping image and a second mapping image correct the optical flow input to the optical flow calculation module, and output the corrected optical flow; wherein, the optical flow input to the first optical flow calculation module is the first video frame and the The preset optical flow between the first intermediate video frames, the optical flow input to other optical flow calculation modules is the optical flow output by the previous optical flow calculation module, and the optical flow output by the last optical flow calculation module is the first optical
  • the first image input to each optical flow calculation module is determined according to the first video frame
  • the second image input to each optical flow calculation module is determined according to the second video frame , including: using the first video frame as the first image input to each optical flow calculation module, and using the second video frame as the second image input to each optical flow calculation module; or, using the The image obtained after the downsampling of the first video frame is used as the first image input to each optical flow calculation module, and the image obtained after the downsampling of the second video frame is used as the second image input to each optical flow calculation module ; wherein, the shapes of the two down-sampled images input by the same optical flow calculation module are the same; or, the feature map output after the first video frame is processed by the convolution layer is used as the first image input to each optical flow calculation module , and the feature map output by the second video frame after being processed by the convolution layer is used as the second image input to each optical flow calculation module; wherein, the shapes of the two feature maps input by the same optical flow calculation module are the
  • the image obtained by downsampling the first video frame is used as the first image input to each optical flow calculation module
  • the image obtained by downsampling the second video frame As the second image input to each optical flow calculation module, the method includes: down-sampling the first video frame and the second video frame respectively to form an image pyramid of the first video frame and the second video
  • the image pyramid of the frame, each layer of the image pyramid starting from the top layer corresponds to an optical flow calculation module of the first neural network starting from the first optical flow calculation module; starting from the top layer of the two image pyramids and proceeding downwards
  • the layer-by-layer traversal is performed, and the two down-sampled images located in the same layer are respectively used as the first image and the second image input to the optical flow calculation module corresponding to the layer.
  • the feature map output after the first video frame is processed by the convolution layer is used as the first image input to each optical flow calculation module, and the second video frame is convoluted
  • the feature map output after the multi-layer processing is used as the second image input to each optical flow calculation module, including: using the first feature extraction network to perform feature extraction on the first video frame and the second video frame respectively, forming the The feature pyramid of the first video frame and the feature pyramid of the second video frame, each layer of the feature pyramid starting from the top layer corresponds to an optical flow of the first neural network starting from the first optical flow calculation module computing module; wherein, the first feature extraction network is a convolutional neural network; starting from the top layer of the two feature pyramids, the traversal is performed layer by layer, and the two feature maps located in the same layer are respectively used as the input light corresponding to the layer.
  • the first image and the second image of the stream computing module is used as the first image input to each optical flow calculation module, and the second video frame is convoluted
  • the feature map output after the multi-layer processing is
  • the input of the optical flow calculation module can be either the original image (referring to the first video frame or the second video frame), the original image after downsampling, or the feature map, which is very flexible .
  • the feature map is used as the input of the optical flow calculation module, convolution calculation is required, which requires a large amount of calculation.
  • the optical flow calculation The results are also more accurate.
  • the original image or the down-sampled original image is used as the input of the optical flow calculation module, there is no need to perform convolution calculation, the calculation amount is small, and the optical flow calculation efficiency is high.
  • the image pyramid when using the down-sampled image as the input of the optical flow calculation module, the image pyramid can be constructed based on the original image, and then starting from the top layer of the image pyramid (corresponding to the down-sampled image with smaller size and lower precision), the down- The sampled image is input to the corresponding optical flow calculation module to realize the gradual refinement of the optical flow calculation.
  • the feature map when the feature map is used as the input of the optical flow calculation module, the feature pyramid can be constructed based on the original image, and then starting from the top layer of the feature pyramid (corresponding to the feature map with smaller size and lower precision), the feature map is divided layer by layer. Input to the corresponding optical flow calculation module to realize the gradual refinement of the optical flow calculation.
  • the optical flow input to the optical flow calculation module is modified based on the first mapping image and the second mapping image obtained by mapping, and the modified optical flow is output, including: the first mapping image obtained based on the mapping.
  • a mapping image, a second mapping image and the optical flow input to the optical flow calculation module, the second neural network is used to predict the optical flow correction term; the optical flow correction term is used to correct the optical flow input to the optical flow calculation module, And output the corrected optical flow.
  • the optical flow input to the optical flow calculation module is modified based on the first mapping image and the second mapping image obtained by mapping, and the modified optical flow is output, including: the first mapping image obtained based on the mapping.
  • a mapping image and a second mapping image are used to correct the optical flow input to the optical flow calculation module by using the descriptor matching unit, sub-pixel correction layer and regularization layer in the LiteFlownet network, and output the corrected optical flow.
  • the above two implementations provide two schemes for correcting the optical flow of the intermediate frame.
  • One is to directly transfer the optical flow correction structure in the LiteFlownet network, and the other is to design a second neural network for optical flow correction.
  • the second neural network can adopt a simple codec architecture, and its computational complexity is small, which is beneficial to complete the optical flow correction quickly.
  • using a first neural network to calculate the optical flow from the first intermediate video frame to the first video frame and the first intermediate includes: using a first neural network to calculate the optical flow from the first intermediate video frame to the first video frame, and calculating the optical flow from the first intermediate video frame to the first video frame according to the first intermediate video frame.
  • the optical flow of the first video frame calculates the optical flow from the first intermediate video frame to the second video frame; or, using the first neural network to calculate the optical flow from the first intermediate video frame to the second video frame , and calculate the optical flow from the first intermediate video frame to the first video frame according to the optical flow from the first intermediate video frame to the second video frame.
  • the calculating the optical flow from the first intermediate video frame to the second video frame according to the optical flow from the first intermediate video frame to the first video frame includes: calculating the optical flow from the first intermediate video frame to the second video frame.
  • the optical flow from the first intermediate video frame to the first video frame is inverted as the optical flow from the first intermediate video frame to the second video frame;
  • calculating the optical flow from the first intermediate video frame to the first video frame from the optical flow of the second video frame including: inverting the optical flow from the first intermediate video frame to the second video frame as the optical flow of the first intermediate video frame to the first video frame.
  • the optical flow from the first intermediate video frame to the first video frame is the same as that of the first intermediate video frame.
  • the optical flows from the video frame to the second video frame are mutually opposite optical flows (referring to the two optical flows in opposite directions and the same size), which is simple and efficient to calculate. If the first video frame and the second video frame are continuous video frames, or when the frame rate of the video is high, this assumption is easy to satisfy, and any motion of objects in the frame can be approximated as the accumulation of a large number of linear motions.
  • the determining the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame includes: based on the first intermediate video frame to the first intermediate video frame The optical flow of a video frame modifies the first mapped video frame to obtain the first intermediate video frame; or, based on the optical flow from the first intermediate video frame to the second video frame, the Two-mapped video frames are modified to obtain the first intermediate video frame; or, based on the optical flow from the first intermediate video frame to the first video frame and/or the first intermediate video frame to the first intermediate video frame
  • the first fused video frame formed by the fusion of the first mapped video frame and the second mapped video frame is modified to obtain the first intermediate video frame.
  • the initially calculated first intermediate frame video (referring to the first mapped video frame, the second mapped video frame or the first merged video frame) is modified to improve image quality and frame insertion effect.
  • a first fusion formed after fusion of the first mapped video frame and the second mapped video frame is performed. Modifying the video frame to obtain the first intermediate video frame, comprising: based on the first mapped video frame, the second mapped video frame, and the optical flow from the first intermediate video frame to the first video frame , using the third neural network to predict the first image correction term and the first fusion mask; according to the instructions of the pixel values in the first fusion mask, the first mapped video frame and the second mapped video frame are fused is the first fused video frame; the first fused video frame is modified by using the first image modification item to obtain the first intermediate video frame.
  • a third neural network is designed to learn the method of fusion and correction of video frames, which is beneficial to improve the quality of the finally obtained first intermediate video frame.
  • the third neural network includes a second feature extraction network and a codec network
  • the codec network includes an encoder and a decoder
  • the optical flow from the second mapped video frame and the first intermediate video frame to the first video frame and using the third neural network to predict the first image correction term and the first fusion mask, including: using the first image correction term
  • Two feature extraction networks respectively perform feature extraction on the first video frame and the first video frame; use the optical flow from the first intermediate video frame to the first video frame to extract the second feature extraction network Perform backward mapping on the obtained feature map; map the obtained mapped feature map, the first mapped video frame, the second mapped video frame, and the optical flow from the first intermediate video frame to the first video frame Input to the encoder for feature extraction; use the decoder to predict the first image modification item and the first fusion mask according to the features extracted by the encoder.
  • the deep-level features (such as edges, textures, etc.) in the original image are extracted by designing the second feature extraction network, and these features are input into the codec network, which is beneficial to improve the effect of image correction.
  • determining the first intermediate video frame according to the first mapped video frame and the second mapped video frame includes: based on the first mapped video frame and the second mapped video frame and the optical flow from the first intermediate video frame to the first video frame, using the fourth neural network to predict a second fusion mask; The mapped video frame and the second mapped video frame are fused into the first intermediate video frame.
  • designing a fourth neural network for learning a method for fusion of video frames is beneficial to improve the quality of the finally obtained first intermediate video frames.
  • An embodiment of the present application provides a model training method, including: acquiring a training sample, where the training sample includes a third video frame, a fourth video frame, and a reference between the third video frame and the fourth video frame video frame; based on the third video frame and the fourth video frame, using the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame and/or the second intermediate video frame to the The optical flow of the fourth video frame; wherein, the second intermediate video frame is a video frame to be inserted between the third video frame and the fourth video frame; using the second intermediate video frame to The optical flow of the third video frame performs backward mapping on the third video frame to obtain a third mapped video frame, and/or, uses the optical flow from the second intermediate video frame to the fourth video frame to map the third video frame.
  • the above method is used for training the first neural network used in the video frame insertion method, and the neural network can accurately calculate the optical flow of the intermediate frame and improve the frame insertion effect.
  • the calculating the prediction loss according to the second intermediate video frame and the reference video frame includes: calculating a first loss according to a difference between the second intermediate video frame and the reference video frame; respectively calculating the image gradient of the second intermediate video frame and the image gradient of the reference video frame, and calculating a second loss according to the difference between the image gradient of the second intermediate video frame and the image gradient of the reference video frame; The predicted loss is calculated based on the first loss and the second loss.
  • adding a second loss representing the gradient image difference into the prediction loss is beneficial to improve the problem of blurred object edges in the generated second intermediate video frame.
  • the calculating the prediction loss according to the second intermediate video frame and the reference video frame includes: calculating a first loss according to a difference between the second intermediate video frame and the reference video frame; Calculate the optical flow from the reference video frame to the third video frame and/or the optical flow from the reference video frame to the fourth video frame by using the pre-trained fifth neural network; according to the first neural network A third loss is calculated based on the difference between the optical flow calculated by the network and the corresponding optical flow calculated by the fifth neural network; the predicted loss is calculated according to the first loss and the third loss.
  • the optical flow calculated by the pre-trained fifth neural network is used as a label to perform supervised training on the first neural network, and the optical flow knowledge transfer is realized (specifically, a third loss is added to the prediction loss) , which is beneficial to improve the prediction accuracy of the optical flow of the intermediate frame by the first neural network, thereby improving the quality of the finally obtained first intermediate video frame.
  • the first neural network includes at least one optical flow calculation module connected in sequence, and each optical flow calculation module outputs the second intermediate video frame modified by the module to the third intermediate video frame.
  • Optical flow of a video frame; the calculating a prediction loss according to the second intermediate video frame and the reference video frame includes: calculating a first loss according to the difference between the second intermediate video frame and the reference video frame; using The pre-trained fifth neural network calculates the optical flow from the reference video frame to the third video frame; according to the difference between the optical flow output by each optical flow calculation module and the optical flow calculated by the fifth neural network A fourth loss is calculated from the difference of ; the predicted loss is calculated according to the first loss and the fourth loss.
  • the optical flow calculated by the pre-trained fifth neural network is used as a label to perform supervised training on the first neural network, and the optical flow knowledge transfer is realized (specifically, the fourth loss is added to the prediction loss) , which is beneficial to improve the prediction accuracy of the optical flow of the intermediate frame by the first neural network, thereby improving the quality of the finally obtained first intermediate video frame.
  • the optical flow calculation result is gradually generated from coarse to fine, so that the loss calculation can be performed on the output of each optical flow calculation module, and the fourth loss can be obtained by accumulating,
  • the fourth loss it is beneficial to adjust the parameters of each optical flow calculation module more accurately, so that the prediction ability of each optical flow calculation module is improved.
  • the calculating the third loss according to the difference between the optical flow calculated by the first neural network and the corresponding optical flow calculated by the fifth neural network includes: using the fifth neural network to calculate the third loss.
  • the optical flow calculated by the neural network performs backward mapping on the third video frame to obtain a fifth mapped video frame; determining the fifth neural network calculation according to the difference between the fifth mapped video frame and the reference video frame Whether the optical flow vector at each pixel position obtained is accurate; the first effective optical flow vector in the optical flow is calculated according to the fifth neural network and the second effective optical flow in the corresponding optical flow calculated by the first neural network is calculated.
  • the difference of optical flow vectors calculates the third loss; wherein, the first effective optical flow vector refers to the accurate optical flow vector calculated by the fifth neural network, and the second effective optical flow vector refers to the first neural network.
  • the optical flow vector located at the pixel position corresponding to the first effective optical flow vector in the corresponding optical flow calculated by the network.
  • calculating the fourth loss according to the difference between the optical flow output by each optical flow calculation module and the optical flow calculated by the fifth neural network includes: using the fifth neural network The calculated optical flow performs backward mapping on the third video frame to obtain a fifth mapped video frame; according to the difference between the fifth mapped video frame and the reference video frame, determine the value calculated by the fifth neural network.
  • the optical flow vector at each pixel position is accurate; calculate the difference between the first effective optical flow vector in the optical flow and the third effective optical flow vector in the optical flow output by each optical flow calculation module according to the fifth neural network Difference calculation of the fourth loss; wherein, the first effective optical flow vector refers to the accurate optical flow vector calculated by the fifth neural network, and the third effective optical flow vector refers to the light output by each optical flow calculation module.
  • optical flow vectors calculated at some pixel positions may be inaccurate due to the ambiguity of the boundary and the occlusion area.
  • optical flow vectors instead of using it as a label for supervised learning by the first neural network, only those optical flow vectors that are more accurately calculated are used as optical flow labels, that is, the content of the above two implementations.
  • the first neural network includes at least one optical flow calculation module connected in sequence, and each optical flow calculation module utilizes a pair of descriptor matching units, sub-pixel correction layers and regularization layers in the LiteFlownet network.
  • the optical flow input into the optical flow calculation module is corrected, and based on the third video frame and the fourth video frame, the first neural network is used to calculate the optical flow from the second intermediate video frame to the third video frame.
  • the method further includes: initializing the parameters of the first neural network using parameters obtained by pre-training of the LiteFlownet network.
  • the optical flow calculation module in the first neural network is obtained by structure migration based on the LiteFlownet network, when training the first neural network, the parameters of the LiteFlownet network can be directly loaded as the initial value of its parameters, and on this basis Performing parameter fine-tuning can not only speed up the convergence speed of the first neural network, but also help to improve its performance.
  • determining the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame includes: based on the third mapped video frame, the fourth mapped video frame and the optical flow from the second intermediate video frame to the third video frame, using the third neural network to predict the second image correction term and the third fusion mask; according to the indication of pixel values in the third fusion mask fusing the third mapped video frame and the fourth mapped video frame into a second fused video frame; using the second image correction item to modify the second fused video frame to obtain the second intermediate video frame; calculating the prediction loss according to the second intermediate video frame and the reference video frame, and updating the parameters of the first neural network according to the prediction loss, including: according to the second intermediate video frame and the The prediction loss is calculated from the reference video frame, and the parameters of the first neural network and the third neural network are updated according to the prediction loss.
  • the third neural network is used for image correction when the first neural network is used for frame insertion, the third neural network can be trained together with the first neural network in the model training stage, which is beneficial to simplify the training process.
  • determining the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame includes: based on the third mapped video frame, the fourth mapped video frame and the optical flow from the second intermediate video frame to the third video frame, using the fourth neural network to predict the second image correction term and the fourth fusion mask; according to the indication of pixel values in the fourth fusion mask fusing the third mapped video frame and the fourth mapped video frame into the second intermediate video frame; calculating a prediction loss according to the second intermediate video frame and the reference video frame, and calculating the prediction loss according to the second intermediate video frame and the reference video frame
  • Prediction loss updating the parameters of the first neural network includes: calculating prediction loss according to the second intermediate video frame and the reference video frame, and updating the first neural network and the fourth neural network according to the prediction loss parameters of the neural network.
  • the fourth neural network is used for image correction when the first neural network is used for frame insertion, the fourth neural network can be trained together with the first neural network in the model training stage, which is beneficial to simplify the training process.
  • An embodiment of the present application provides a device for video frame insertion, including: a first video frame obtaining unit, configured to obtain a first video frame and a second video frame; a first optical flow calculation unit, configured based on the first video frame and the second video frame, using the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame and/or the optical flow from the first intermediate video frame to the second video frame; wherein , the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame; a first backward mapping unit is configured to use the first intermediate video frame to The optical flow of the first video frame performs backward mapping on the first video frame to obtain a first mapped video frame, and/or, uses the optical flow from the first intermediate video frame to the second video frame to map the first video frame. performing backward mapping on the second video frame to obtain a second mapped video frame; a first intermediate frame determination unit, configured to determine the first intermediate frame according to the first mapped video frame and/or the second mapped video frame
  • An embodiment of the present application provides a model training apparatus, including: a second video frame obtaining unit, configured to obtain a training sample, where the training sample includes a third video frame, a fourth video frame, and a the reference video frame between the fourth video frames; the second optical flow calculation unit is configured to use the first neural network to calculate the second intermediate video frame to the The optical flow of the third video frame and/or the optical flow from the second intermediate video frame to the fourth video frame; wherein, the second intermediate video frame is to be inserted into the third video frame and the fourth video a video frame between frames; a second backward mapping unit, configured to perform backward mapping on the third video frame by using the optical flow from the second intermediate video frame to the third video frame to obtain a third mapping video frame, and/or, using the optical flow from the second intermediate video frame to the fourth video frame to perform backward mapping on the fourth video frame to obtain a fourth mapped video frame; a second intermediate frame determining unit , for determining the second intermediate video frame according to the third mapped video frame and/or the fourth
  • the embodiments of the present application provide a computer-readable storage medium, where computer program instructions are stored on the computer-readable storage medium, and when the computer program instructions are read and run by a processor, the video provided in the embodiments of the present application is executed.
  • Frame insertion method
  • An embodiment of the present application provides an electronic device, including: a memory and a processor, where computer program instructions are stored in the memory, and when the computer program instructions are read and run by the processor, the computer program instructions provided in the embodiments of the present application are executed. video frame interpolation method.
  • Fig. 1 shows a possible flow of the video frame insertion method provided by the embodiment of the present application
  • Fig. 2 shows a possible network architecture of the video frame insertion method provided by the embodiment of the present application
  • FIG. 3 shows a possible structure of the first neural network provided by the embodiment of the present application
  • FIG. 4 shows a method for constructing a first image and a second image by a feature pyramid
  • FIG. 5 shows a possible structure of the second neural network provided by the embodiment of the present application.
  • FIG. 7 shows a possible process of the model training method provided by the embodiment of the present application.
  • FIG. 8 shows a possible network architecture of the model training method provided by the embodiment of the present application.
  • FIG. 9 shows a possible structure of a video frame insertion device provided by an embodiment of the present application.
  • FIG. 10 shows another possible structure of the video frame insertion device provided by the embodiment of the present application.
  • FIG. 11 shows a possible structure of the electronic device provided by the embodiment of the present application.
  • FIG. 1 shows a possible flow of the video frame insertion method provided by the embodiment of the present application
  • FIG. 2 shows a network architecture that can be used in the method, for reference when describing the video frame insertion method.
  • the method in FIG. 1 may be performed by, but is not limited to, the electronic device shown in FIG. 11 .
  • the method includes:
  • Step S110 Acquire the first video frame and the second video frame.
  • the first video frame and the second video frame are two frames before and after the video to be inserted, and the first video frame and the second video frame may be two consecutive frames before and after, or may not be two consecutive frames. Except for the timing relationship between the two, this application does not limit the selection of the first video frame and the second video frame.
  • the first video frame is denoted as I 1
  • the first video frame is denoted as I 2 .
  • Step S120 Based on the first video frame and the second video frame, use the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame and the optical flow from the first intermediate video frame to the second video frame.
  • the first intermediate video frame is a video frame to be inserted between I 1 and I 2.
  • the application does not limit the insertion position of the first intermediate video frame. For example, it can be inserted into the middle position of I 1 and I 2 , or May not be inserted in the middle of I 1 and I 2 .
  • the first intermediate video frame is denoted as I syn1 .
  • the so-called frame insertion is mainly to obtain I syn1 , and it is easy to insert I syn1 into the video.
  • the solution of the present application obtains I syn1 based on the optical flow of the first intermediate video frame.
  • the optical flow of the first intermediate video frame includes the optical flow from the first intermediate video frame to the first video frame and the optical flow from the first intermediate video frame to the second video frame.
  • the optical flow of , the former is recorded as Flow mid ⁇ 1
  • the latter is recorded as Flow mid ⁇ 2 .
  • I 1 and I 2 may be input to the first neural network, and Flow mid ⁇ 1 and Flow mid ⁇ 2 may be predicted by the first neural network, respectively.
  • Flow mid ⁇ 1 may be calculated by using the first neural network, and Flow mid ⁇ 2 may be converted according to Flow mid ⁇ 1 , as shown in FIG. 2 (Flow mid ⁇ 2 is not shown).
  • Flow mid ⁇ 2 is not shown.
  • the first neural network it is also possible to use the first neural network to calculate Flow mid ⁇ 2 , and to convert Flow mid ⁇ 1 according to Flow mid ⁇ 2 .
  • the required two optical flows can be obtained only by using the first neural network to perform one optical flow calculation, thus significantly improving the efficiency of the optical flow calculation.
  • Flow mid ⁇ 1 and Flow mid ⁇ 2 are mutually opposite optical flows, and one optical flow is obtained and taken. Conversely, another optical flow can be calculated.
  • Figure 3 shows the structure of a first neural network that can calculate Flow mid ⁇ 1 .
  • the first neural network includes at least one optical flow calculation module connected in sequence (three optical flow calculation modules are shown in the figure). Each optical flow calculation module is used to correct the optical flow input to the module, and output the corrected optical flow.
  • the optical flow input to the first optical flow calculation module is a preset Flow mid ⁇ 1 . Since no optical flow calculation has been performed at this time, the preset optical flow The flow can take a default value, such as zero (meaning that all optical flow vectors contained in the optical flow take zero).
  • the first optical flow calculation module corrects the preset Flow mid ⁇ 1 , it outputs a correction result, and the correction result can also be regarded as the Flow mid ⁇ 1 calculated by the first optical flow calculation module.
  • the Flow mid ⁇ 1 output by the previous optical flow calculation module is corrected, and the correction result is output, and the correction result can be regarded as the calculation of the optical flow calculation module.
  • the output Flow mid ⁇ 1 is the optical flow finally calculated by the first neural network. It can be seen that in the first neural network, the calculation result of Flow mid ⁇ 1 is continuously revised from coarse to fine, and finally a relatively accurate optical flow calculation result is obtained.
  • Each optical flow calculation module has a similar structure, as shown on the left side of Figure 3.
  • the input of the optical flow calculation module also includes the first image and the second image, which are denoted as J 1 and J 2 respectively for the convenience of explanation, but J 1 and J input by different optical flow calculation modules 2 are not necessarily the same.
  • J 1 is determined according to I 1
  • J 2 is determined according to I 2 , which may specifically include, but is not limited to, one of the following ways:
  • Mode (1) Take I 1 as J 1 directly, take I 2 as J 2 , and input I 1 and I 2 to each optical flow calculation module. Mode (1) does not need to calculate the input of the optical flow calculation module, so it is beneficial to improve the efficiency of optical flow calculation.
  • Method (2) The feature map output after I 1 is processed by the convolution layer is taken as J 1 , and the feature map output after I 2 is processed by the convolution layer is taken as J 2 . Since I 1 and I 2 can output multiple feature maps with different scales after being processed by multiple convolutional layers, for each optical flow calculation module, feature maps with different scales can be input, but the J input from the same optical flow module 1 and J 2 are the same shape.
  • Method (2) requires convolution calculation for the input of the optical flow calculation module, which requires a large amount of calculation. However, because more deep-level features in the image are considered in the optical flow calculation, the optical flow calculation result is also more accurate. .
  • a first feature extraction network may be used to extract features for I1 and I2 , respectively, to form a feature pyramid of I1 and a feature pyramid of I2 , wherein the first feature extraction network is a convolutional neural network , each layer of the feature pyramid starting from the top layer corresponds to an optical flow calculation module of the first neural network starting from the first optical flow calculation module, and the feature shapes of the same layer of the image pyramid are the same.
  • the first layer corresponds to the optical flow calculation module 1
  • the second layer corresponds to the optical flow calculation module 2
  • the third layer correspond to the optical flow calculation module 3.
  • the layers of I 1 and I 2 correspond to the optical flow calculation module 3.
  • Each layer of the feature pyramid is a feature map.
  • the feature map of the i-th layer is denoted as and have the same shape.
  • the top layer corresponds to the feature maps with smaller size and lower accuracy
  • the bottom layer corresponds to the feature maps with larger size and higher accuracy.
  • the layer inputs the feature map to the corresponding optical flow calculation module, which is conducive to the gradual refinement of the optical flow calculation. But generally speaking, according to the characteristics of the convolutional neural network, the large-size feature maps are extracted first, and the small-size feature maps are extracted later, that is, the construction order of the feature pyramid is from the bottom to the top.
  • I 1 and I 2 themselves can also be regarded as a special feature map, in mode (2), I 1 and I 2 are not excluded as J 1 of the first optical flow calculation module. and J 2 .
  • the convolution operation can also be regarded as downsampling to a certain extent, but the downsampling in mode (3) should be understood as not including downsampling through convolution, for example, it can be directly spaced according to the downsampling multiple.
  • the pixels in the original image are extracted for downsampling.
  • I 1 and I 2 may be down-sampled to form an image pyramid of I 1 and an image pyramid of I 2 , and each layer of the image pyramid starting from the top layer corresponds to the first neural network from the first light
  • the structure of the image pyramid is similar to that of the feature pyramid, except that the down-sampled original image (referred to as I 1 or I 2 ) instead of the feature map constitutes the pyramid.
  • the top layer corresponds to a downsampled image with a smaller size and lower precision
  • the bottom layer corresponds to a downsampled image with a larger size and higher precision.
  • the down-sampled images are input to the corresponding optical flow calculation module layer by layer, which is conducive to the gradual refinement of the optical flow calculation.
  • the large-sized down-sampled images are generated first, and the small-sized down-sampled images are generated later, that is, the order of building the image pyramid is from the bottom layer to the top layer.
  • I 1 and I 2 themselves can also be regarded as a special down-sampling image (the down-sampling multiple is 1), it is not excluded to use I 1 and I 2 as the first image in mode (3). J 1 and J 2 of the optical flow calculation modules.
  • backward warp is performed on J 1 input to the optical flow calculation module based on Flow mid ⁇ 1 input to the optical flow calculation module to obtain a first mapped image, denoted by for that is and perform backward mapping on J 2 input to the optical flow calculation module to obtain a second mapping image, denoted as that is
  • the optical flow calculation module includes an optical flow correction module, and the optical flow correction module is used to input the Flow mid ⁇ 1 of the optical flow calculation module and the above. is the input, which is used to correct Flow mid ⁇ 1 , and output the corrected Flow mid ⁇ 1 , which is also the output of the optical flow calculation module.
  • optical flow correction module Two implementations of the optical flow correction module are listed below. It can be understood that the optical flow correction module can also adopt other implementations:
  • the Flow mid ⁇ 1 of the input optical flow calculation module is added to Flow res (which may be direct addition or weighted summation) to obtain the revised Flow mid ⁇ 1 .
  • the second neural network can adopt a relatively simple network structure, so as to reduce the amount of calculation and improve the efficiency of optical flow correction, thus speeding up the calculation speed of the optical flow by the optical flow calculation module.
  • the second neural network may employ an encoder-decoder network, and FIG. 5 shows a possible structure of the second neural network.
  • the left part of the network (R1 to R4) is the encoder and the right part (D1 to D4) is the decoder.
  • Flow mid ⁇ 1 The three items of data are spliced and input to R1.
  • each encoding module except R4 will also input them into the decoder, and add them to the output of the corresponding decoding module.
  • the features extracted by R4 are directly output to D4, and D1 outputs the optical flow correction item Flow res predicted by the second neural network.
  • the intermediate output of the second neural network (referring to the output of the convolutional layer and the deconvolutional layer) can be batch normalized, and use Prelu as the nonlinear activation function. It can be understood that FIG. 5 is only an example, and the second neural network can also adopt other structures.
  • the LiteFlownet network is an existing network that can be used for optical flow calculation, but the LiteFlownet network can only be used to calculate the optical flow between the frames before and after, such as the optical flow Flow 1 ⁇ 2 from the first video frame to the second video frame, not Used to calculate the intermediate frame optical flow Flow mid ⁇ 1 .
  • the optical flow inference module In the NetE part of the LiteFlownet network, there is also a structure similar to the optical flow correction module, which is called the optical flow inference module. This structure can be roughly divided into three parts: the descriptor matching unit. ), sub-pixel refinement unit, and regularization module.
  • optical flow reasoning module can be directly migrated to the optical flow correction module of this application, but the input of each part needs to be modified to a certain extent:
  • the input of the descriptor matching unit is transformed as and Flow mid ⁇ 1 before correction, calculated in the descriptor matching unit and matching cost volume between
  • the four items of information of Flow mid ⁇ 1 before correction and the calculated matching cost capacity are input to the convolutional neural network in the descriptor matching unit for calculation, and finally the Flow mid ⁇ 1 calculated by the descriptor matching unit is output.
  • the matching cost capacity is used to measure the mapping image and coincidence between.
  • the input of the subpixel correction layer is transformed as As well as describing the Flow mid ⁇ 1 output by the sub-matching unit, the sub-pixel correction layer corrects the input Flow mid ⁇ 1 at sub-pixel accuracy, and outputs the revised Flow mid ⁇ 1 .
  • the input of the regularization layer is transformed as And the Flow mid ⁇ 1 output by the sub-pixel correction layer, the regularization layer smoothes the input Flow mid ⁇ 1 , and outputs the corrected Flow mid ⁇ 1 , which is the output of the optical flow correction module.
  • a feature pyramid will be constructed in the NetC part of the LiteFlownet network, so that this part of the convolutional layer can also be migrated to the solution of this application as the first feature extraction network for extracting and J 1 and J 2 as input to the optical flow calculation module.
  • the method (2) effectively migrates the existing optical flow calculation results, but because the LiteFlownet network contains more operators, the operation will be more complicated.
  • Step S130 performing backward mapping on the first video frame by using the optical flow from the first intermediate video frame to the first video frame to obtain the first mapped video frame, and using the optical flow from the first intermediate video frame to the second video frame Perform backward mapping on the second video frame to obtain a second mapped video frame.
  • Flow mid ⁇ 1 can be used to perform backward mapping on I 1 to obtain the first mapped video frame, denoted as that is And perform backward mapping to I 2 to obtain the second mapped video frame, denoted as that is as shown in picture 2.
  • Step S140 Determine the first intermediate video frame according to the first mapped video frame and the second mapped video frame.
  • the and Perform fusion to obtain the first fusion video frame, denoted as I fusion1 , and then correct I fusion1 according to Flow mid ⁇ 1 and/or Flow mid ⁇ 2 , and use the corrected image as I syn1 , which is conducive to improving Image quality of I syn1 , improved frame insertion.
  • I fusion1 can be corrected only according to Flow mid ⁇ 1 or Flow mid ⁇ 2 .
  • the above frame fusion and image correction processes can be performed successively, for example, first right and Average to get I fusion1 , and then design a neural network to correct I fusion1 .
  • the process of frame fusion and image correction can also be implemented based on a neural network, that is, using the neural network to learn the methods of video frame fusion and image correction at the same time, as shown in Figure 2.
  • the Flow mid ⁇ 1 is input to the third neural network, and the third neural network is used to predict the first image correction item and the first fusion mask, which are denoted as I res1 and mask1 respectively.
  • each pixel value in mask1 can only take 0 or 1, and the pixel value at a weak position is 0, indicating that the pixel value of I fusion1 at this position is The pixel value at this position, if the pixel value at a certain position is 1, it means that the pixel value of I fusion1 at this position is The pixel value at this location.
  • I fusion1 is corrected by I res1 to obtain I syn1 .
  • the third neural network includes a second feature extraction network and a codec network. Its working principle is as follows: first, the second feature extraction network compares I 1 and I 2 respectively. Perform feature extraction, and then use Flow mid ⁇ 1 to perform backward mapping on the feature map extracted by the second feature extraction network, and then map the mapped feature map obtained by mapping, And Flow mid ⁇ 1 is input to the encoder of the encoder-decoder network for feature extraction, and finally the decoder of the encoder-decoder network is used to predict I res1 and mask1 according to the features extracted by the encoder.
  • Figure 6 shows an implementation of the third neural network in accordance with the above description.
  • the left part (C1 to C3) of the network is the second feature extraction network
  • the right part is the codec network, wherein the main structure of the codec network is similar to that in FIG.
  • Ci represents one or more convolutional layers, so that two 3-layer feature pyramids are constructed using the second feature extraction network.
  • the feature maps F 1-i and F 2-i are respectively backward mapped, and the obtained mapped feature maps are denoted as warp(F 1-i ) and warp(F 2-i ) ).
  • warp(F 1-i ) and warp(F 2-i ) are concatenated with the output of the encoding module Ri as the input of the encoding module Ri+1.
  • FIG. 6 is only an example, and the third neural network can also adopt other structures.
  • the deep-level features (such as edges, textures, etc.) in the original image are extracted by designing the second feature extraction network, and these features are input into the codec network, which is beneficial to improve the effect of image correction.
  • I res1 and mask1 are predicted by the third neural network, but in other implementations, the scheme can be optionally simplified: first, the Flow mid ⁇ 1 is input to the fourth neural network, and then the fourth neural network is used to predict the second fusion mask, denoted as mask2, and finally according to the indication of the pixel value in mask2 and fused to directly fused to I syn1 .
  • These implementations do not need to calculate I res1 , so the calculation process is simpler, and the fourth neural network can also focus on learning the fusion mask.
  • the design of the fourth neural network may refer to the third neural network, which will not be described in detail here.
  • Step A1 obtaining the first video frame and the second video frame
  • Step A2 Based on the first video frame and the second video frame, use the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame;
  • Step A4 Determine the first intermediate video frame according to the first mapped video frame.
  • the first mapped video frame may be directly used as the first intermediate video frame; the first mapped video frame may also be modified based on the optical flow from the first intermediate video frame to the first video frame , to obtain the first intermediate video frame.
  • a neural network can be designed to modify the first mapped video frame.
  • the structure of the neural network can refer to the third neural network, but since it does not involve video frame fusion, the neural network only needs to It is enough to predict the image correction term.
  • steps S110 to S140 which will not be described in detail.
  • Step B1 obtaining the first video frame and the second video frame
  • Step B2 Based on the first video frame and the second video frame, use the first neural network to calculate the optical flow from the first intermediate video frame to the second video frame;
  • Step B3 using the optical flow from the first intermediate video frame to the second video frame to perform backward mapping on the second video frame to obtain the second mapped video frame;
  • the second mapped video frame may be directly used as the first intermediate video frame; the second mapped video frame may also be modified based on the optical flow from the first intermediate video frame to the second video frame , to obtain the first intermediate video frame.
  • steps B1 to B4 reference may be made to steps S110 to S140, which will not be described in detail.
  • the frame insertion method when performing video frame insertion, directly calculates the optical flow of the intermediate frame (referring to the first intermediate video frame) based on the first video frame and the second video frame by using the first neural network. frame to first video frame and/or first intermediate video frame to second video frame optical flow) without using the optical flow between the first video frame and the second video frame to calculate the intermediate frame optical flow,
  • the accuracy of the optical flow of the intermediate frame thus obtained is high, and the image quality of the first intermediate video frame obtained on this basis is good, and ghost images are not easily generated at the edge of the moving object.
  • the steps of the above method are simple, and the frame insertion efficiency is significantly improved, so that good results can also be achieved when applied to scenarios such as real-time frame insertion and high-definition video frame insertion.
  • forward-backward mapping needs to solve the fusion problem when multiple points are mapped to the same position, and the current hardware does not support forward-backward mapping enough, so in this application, backward-backward mapping is mainly used as an example. , but it is not intended to preclude the use of forward-backward-mapping schemes.
  • FIG. 7 shows a possible flow of the model training method provided by the embodiment of the present application, and the method can be used to train the first neural network model used in the model frame insertion method in FIG. 1 .
  • Figure 8 shows a network architecture that can be used in the method for reference when describing the model training method.
  • the method in FIG. 7 may be performed by, but is not limited to, the electronic device shown in FIG. 11 , and for the structure of the electronic device, reference may be made to the following description about FIG. 11 .
  • the method includes:
  • the training set consists of multiple training samples, and each training sample is used in a similar manner during the training process. Therefore, any one of the training samples can be used as an example to illustrate the training process.
  • Each training sample may include 3 video frames, namely the third video frame, the fourth video frame, and the reference video frame located between the third video frame and the fourth video frame, and these 3 video frames are denoted as I 3 respectively , I 4 and I mid , as shown in FIG. 8 .
  • the video frame to be inserted between I 3 and I 4 is the second intermediate video frame, denoted as I syn2
  • I mid corresponds to I syn2 , representing the real video frame at the position of I syn2 (that is, the ground truth of the intermediate frame) .
  • Step S220 Based on the third video frame and the fourth video frame, use the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame and the optical flow from the second intermediate video frame to the fourth video frame.
  • step S120 the optical flow from the second intermediate video frame to the third video frame and the optical flow from the second intermediate video frame to the fourth video frame are denoted as Flow mid ⁇ 3 and Flow mid ⁇ 4 , respectively.
  • Step S230 performing backward mapping on the third video frame by using the optical flow from the second intermediate video frame to the third video frame to obtain a third mapped video frame, and using the optical flow from the second intermediate video frame to the fourth video frame Perform backward mapping on the fourth video frame to obtain a fourth mapped video frame.
  • Flow mid ⁇ 3 can be used to perform backward mapping on I 3 to obtain a third mapped video frame, denoted as that is And carry out backward mapping to I 4 , obtain the fourth mapping video frame, denoted as that is As shown in Figure 8.
  • Step S240 Determine the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame.
  • the above scheme can also be simplified: first, the Flow mid ⁇ 3 is input to the fourth neural network, and then the fourth neural network is used to predict the fourth fusion mask, denoted as mask4, and finally according to the indication of the pixel value in mask4 and fused to directly fused to I syn2 .
  • image correction may not be performed, for example, directly and Take the average to get I syn2 .
  • Step S250 Calculate the prediction loss according to the second intermediate video frame and the reference video frame, and update the parameters of the first neural network according to the prediction loss.
  • the first neural network is bound to be used. Therefore, after the prediction loss is calculated, the parameters of the first neural network can be updated by using the back-propagation algorithm.
  • the third neural network is used in step S240, then in step S250, the parameters of the third neural network are also updated together, that is, the third neural network and the first neural network are trained together, which can simplify the training process .
  • the fourth neural network is used in step S240, in step S250, the parameters of the fourth neural network are also updated together, that is, the fourth neural network and the first neural network are trained together.
  • steps S210 to S250 are iteratively executed, and the training is terminated when the training termination condition (eg, model convergence, etc.) is satisfied.
  • the prediction loss can be uniformly expressed by the following formula:
  • Loss sum Loss l1 + ⁇ Loss sobel + ⁇ Loss epe + ⁇ Loss multiscale-epe
  • Loss sum is the total prediction loss, and there are four losses on the right side, namely the first loss Loss l1 , the second loss Loss sobel , the third loss Loss epe and the fourth loss Loss multiscale-epe , where the first loss is The basic loss must be included when calculating the prediction loss.
  • the other three losses are optional. Depending on the implementation, one or more of them can be added, or none of them can be added, but pay attention to the third loss and the third loss.
  • Four losses cannot be added at the same time.
  • ⁇ , ⁇ , and ⁇ are weighting coefficients, which are used as hyperparameters of the network. It should be understood that other loss terms may also be added to the right-hand side of the equation. Each loss is described in detail below:
  • the first loss is calculated according to the difference between I syn2 and I mid , and the purpose of setting the first loss is to make I syn2 closer to I mid through learning, that is, to make the image quality of the intermediate frame better.
  • the difference between I syn2 and I mid can be defined as the pixel-by-pixel distance between the two, for example, when using the L1 distance:
  • Loss l1 ⁇ i ⁇ j
  • i and j together represent a pixel position.
  • the second loss is calculated according to the difference between the image gradient of I syn2 and the image gradient of I mid .
  • the purpose of setting the second loss is to improve the problem of blurred object edges of the generated I syn2 through learning (the image gradient corresponds to the edge in the image information).
  • the image gradient can be calculated by applying a gradient operator to the image, such as Sobel operator, Roberts operator, Prewitt operator, etc.
  • the difference between the image gradient of I syn2 and the image gradient of I mid can be defined as the pixel-by-pixel distance between the two . For example, when using Sobel operator and L1 distance:
  • Loss sobel ⁇ i ⁇ j
  • Sobel( ) represents the use of the Sobel operator to calculate the image gradient of an image.
  • optical flow labels can be set to perform supervised training of the first neural network.
  • the third loss is based on the Flow mid ⁇ 3 calculated by the third neural network and The purpose of setting the third loss is to improve the accuracy of Flow mid ⁇ 3 calculated by the third neural network through learning.
  • This loss reflects the optical flow from the fifth neural network to the third neural network. Knowledge transfer.
  • Flow mid ⁇ 3 is the same as The difference can be defined as the distance between the optical flow vectors contained in the two (L2 distance), which is expressed as follows:
  • the first neural network includes at least one optical flow calculation module (refer to FIG. 3 for its structure), and each optical flow calculation module outputs the Flow mid ⁇ 3 corrected by the module, from coarse to fine Flow mid ⁇ 3 is calculated.
  • each optical flow calculation module can be supervised by using the optical flow label to improve the optical flow calculation capability of each optical flow calculation module.
  • the specific method is that, for each optical flow calculation module, the output optical flow Flow mid ⁇ 3 and the optical flow calculated by the fifth neural network are used. The difference between them calculates a loss (for the calculation method, please refer to the calculation of the third loss), and then accumulate these losses to obtain the fourth loss.
  • the formula for calculating the fourth loss is as follows:
  • n represents the total number of optical flow calculation modules
  • the fourth loss can also realize the optical flow knowledge transfer from the fifth neural network to the third neural network, and by calculating the fourth loss, it is beneficial to adjust the parameters of each optical flow calculation module more accurately , but the fourth loss is computationally more complicated.
  • the inventor's long-term research has found that when the fifth neural network performs optical flow calculation, the optical flow vector calculated at some pixel positions may be inaccurate due to the ambiguity of the boundary and the occlusion area.
  • the optical flow vector can not be used as the label for the supervised learning of the first neural network, but only those optical flow vectors that are more accurately calculated are used as the optical flow label.
  • the specific method is as follows:
  • the difference from I mid determines whether the optical flow vector at each pixel position calculated by the fifth neural network is accurate. For example, it can be calculated The mean value of the L1 distance from I mid at each pixel (because the video frame may be a multi-channel image, the mean value can be calculated at each pixel), if the mean value of the L1 distance is greater than a certain threshold, it indicates that the fifth neural
  • the optical flow vector calculated by the network at the pixel position is inaccurate, otherwise it indicates that the optical flow vector calculated by the fifth neural network at the pixel position is accurate.
  • it may be called the first effective optical flow vector. It shows that the first effective optical flow vector accounts for the vast majority of the optical flow vectors calculated by the fifth neural network, because the fifth neural network is equivalent to calculating the optical flow of the intermediate frame when the intermediate frame is known, and its accuracy is still Guaranteed.
  • the third loss or the fourth loss is calculated according to the first effective optical flow vector in the optical flow calculated by the fifth neural network:
  • the fifth neural network When calculating the third loss, according to the fifth neural network The difference between the first effective optical flow vector in and the second effective optical flow vector in Flow mid ⁇ 3 calculated by the first neural network is calculated; wherein, the second effective optical flow vector refers to the first effective optical flow vector calculated by the first neural network.
  • the fifth neural network calculates The optical flow vector located at (1,1) is a first effective optical flow vector, then the optical flow vector located at (1,1) in Flow mid ⁇ 3 calculated by the first neural network is a second effective optical flow vector Optical flow vector.
  • the difference between the first effective optical flow vector in and the third effective optical flow vector in Flow mid ⁇ 3 output by each optical flow calculation module of the first neural network is calculated (the differences are calculated separately and then accumulated).
  • the third effective optical flow vector refers to the optical flow vector located at the pixel position corresponding to the first effective optical flow vector in Flow mid ⁇ 3 output by each optical flow calculation module.
  • the optical flow calculation module in the first neural network is obtained by performing structure migration based on the LiteFlownet network (that is, in step S220, each optical flow calculation module is migrated from the LiteFlownet network using The descriptor matching unit, the sub-pixel correction layer and the regularization layer correct the optical flow input to the optical flow calculation module).
  • the parameters obtained by the pre-training of the LiteFlownet network can be directly loaded as the initial values of its parameters, and on this basis, the parameters can be fine-tuned.
  • This transfer learning method not only The convergence speed of the first neural network can be accelerated, and its performance can be improved.
  • the LiteFlownet network is pre-trained, but it is not limited to the FlyingChairs dataset.
  • the second intermediate video frame is generated by fusion of the third mapped video frame and the fourth mapped video frame (may be modified), but there are also some schemes, the second intermediate video frame is Directly based on the third mapped video frame or the fourth mapped video frame (possibly with modifications).
  • the specific steps of these programs are as follows:
  • Step C1 obtaining training samples, the training samples include the third video frame, the fourth video frame and the reference video frame;
  • Step C2 based on the third video frame and the fourth video frame, use the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame;
  • Step C3 using the optical flow from the second intermediate video frame to the third video frame to perform backward mapping on the third video frame to obtain the third mapped video frame;
  • Step C4 determining the second intermediate video frame according to the third mapped video frame
  • Step C5 Calculate the prediction loss according to the second intermediate video frame and the reference video frame, and update the parameters of the first neural network according to the prediction loss.
  • Step D1 obtaining training samples, the training samples include a third video frame, a fourth video frame and a reference video frame;
  • Step D2 based on the third video frame and the fourth video frame, use the first neural network to calculate the optical flow from the second intermediate video frame to the fourth video frame;
  • Step D3 using the optical flow from the second intermediate video frame to the fourth video frame to perform backward mapping on the fourth video frame to obtain the fourth mapped video frame;
  • Step D4 determining the second intermediate video frame according to the fourth mapped video frame
  • Step D5 Calculate the prediction loss according to the second intermediate video frame and the reference video frame, and update the parameters of the first neural network according to the prediction loss.
  • step D5 the neural network can perform parameter update together with the first neural network.
  • steps D1 to D5 reference may be made to steps S210 to S250, which will not be described in detail.
  • the calculation result of the fifth neural network should keep corresponding to the calculation result of the first neural network. For example, if the first neural network calculates the optical flow from the second intermediate video frame to the third video frame (scheme C), the fifth neural network should calculate the optical flow between the third video frame and the reference video frame based on the third video frame and the reference video frame. flow; if the first neural network calculates the optical flow from the second intermediate video frame to the fourth video frame (scheme D), then the fifth neural network should calculate the optical flow between the two based on the fourth video frame and the reference video frame.
  • the first video frame obtaining unit 310 is used to obtain the first video frame and the second video frame;
  • a first optical flow calculation unit 320 configured to use a first neural network to calculate an optical flow and/or an optical flow from a first intermediate video frame to the first video frame based on the first video frame and the second video frame The optical flow from the first intermediate video frame to the second video frame; wherein, the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame;
  • a first backward mapping unit 330 configured to perform backward mapping on the first video frame by using the optical flow from the first intermediate video frame to the first video frame to obtain a first mapped video frame, and/or , using the optical flow from the first intermediate video frame to the second video frame to perform backward mapping on the second video frame to obtain a second mapped video frame;
  • a first intermediate frame determining unit 340 configured to determine the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame.
  • the first neural network includes at least one optical flow calculation module connected in sequence
  • the first optical flow calculation unit 320 is based on the first video frame and the second video frame, using the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame, including: determining the first image input to each optical flow calculation module according to the first video frame, and, according to The second video frame determines the second image input to each optical flow calculation module; using each optical flow calculation module, based on the optical flow input to the optical flow calculation module, the first image and the input to the optical flow calculation module are respectively The second image is subjected to backward mapping, and the optical flow input to the optical flow calculation module is corrected based on the first and second mapping images obtained from the mapping, and the corrected optical flow is output; wherein, the first optical flow is input
  • the optical flow of the calculation module is the preset optical flow between the first video frame and the first intermediate video frame
  • the optical flow input to other optical flow calculation modules is the optical flow output by the previous optical flow calculation module
  • the first optical flow calculation unit 320 determines the first image input to each optical flow calculation module according to the first video frame, and determines according to the second video frame
  • Inputting the second image of each optical flow calculation module includes: using the first video frame as the first image input to each optical flow calculation module, and using the second video frame as input to each optical flow calculation The second image of the module; or, the image obtained after the down-sampling of the first video frame is used as the first image input to each optical flow calculation module, and the image obtained after the down-sampling of the second video frame is used as Input the second image of each optical flow calculation module; wherein, the shapes of the two down-sampled images input by the same optical flow calculation module are the same; or, the feature map output after the first video frame is processed by the convolution layer is used as The first image of each optical flow calculation module is input, and the feature map output after the second video frame is processed by the convolution layer is used as the second image input to each optical flow calculation module; wherein, the
  • the first optical flow calculation unit 320 takes the image obtained by down-sampling the first video frame as the first image input to each optical flow calculation module, and calculates the The image obtained after the down-sampling of the second video frame is used as the second image input to each optical flow calculation module, including: down-sampling the first video frame and the second video frame respectively to form the first video frame.
  • each layer of the image pyramid starting from the top layer corresponds to an optical flow calculation module of the first neural network starting from the first optical flow calculation module;
  • the traversal is performed layer by layer, and the two down-sampled images located in the same layer are used as the first image and the second image input to the optical flow calculation module corresponding to the layer respectively.
  • the first optical flow calculation unit 320 uses the feature map output after the first video frame is processed by the convolution layer as the first image input to each optical flow calculation module, And, using the feature map outputted by the second video frame after being processed by the convolutional layer as the second image input to each optical flow calculation module, including: using the first feature extraction network to separately extract the first video frame and the Feature extraction is performed on the second video frame to form a feature pyramid of the first video frame and a feature pyramid of the second video frame, and each layer from the top layer of the feature pyramid corresponds to the first neural network from the first An optical flow calculation module starting from an optical flow calculation module; wherein, the first feature extraction network is a convolutional neural network; from the top layer of the two feature pyramids, the traversal is performed layer by layer, and the two layers located in the same layer are traversed layer by layer.
  • the feature maps are respectively used as the first image and the second image input to the optical flow calculation module corresponding to the layer.
  • the first optical flow calculation unit 320 corrects the optical flow input to the optical flow calculation module based on the first mapped image and the second mapped image obtained by mapping, and outputs the corrected optical flow.
  • the optical flow including: based on the first mapping image obtained by mapping, the second mapping image and the optical flow input to the optical flow calculation module, using the second neural network to predict the optical flow correction item; using the optical flow correction item to input
  • the optical flow of the optical flow calculation module is corrected, and the corrected optical flow is output.
  • the first optical flow calculation unit 320 corrects the optical flow input to the optical flow calculation module based on the first mapped image and the second mapped image obtained by mapping, and outputs the corrected optical flow.
  • the optical flow includes: based on the first mapping image and the second mapping image obtained by mapping, using the descriptor matching unit, sub-pixel correction layer and regularization layer in the LiteFlownet network to modify the optical flow input to the optical flow calculation module , and output the corrected optical flow.
  • the first optical flow calculation unit 320 uses a first neural network to calculate, based on the first video frame and the second video frame, from the first intermediate video frame to the The optical flow of the first video frame and the optical flow from the first intermediate video frame to the second video frame includes: using the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame, and Calculate the optical flow from the first intermediate video frame to the second video frame according to the optical flow from the first intermediate video frame to the first video frame; or, use the first neural network to calculate the first intermediate video frame to the second video frame, and calculate the optical flow from the first intermediate video frame to the first video frame according to the optical flow from the first intermediate video frame to the second video frame.
  • the first optical flow calculation unit 320 calculates the distance between the first intermediate video frame and the first video frame according to the optical flow between the first intermediate video frame and the first video frame.
  • the optical flow of two video frames includes: inverting the optical flow from the first intermediate video frame to the first video frame as the optical flow from the first intermediate video frame to the second video frame;
  • An optical flow calculation unit 320 calculates the optical flow from the first intermediate video frame to the first video frame according to the optical flow from the first intermediate video frame to the second video frame, including: The optical flow from the intermediate video frame to the second video frame is inverted as the optical flow from the first intermediate video frame to the first video frame.
  • the first intermediate frame determining unit 340 determines the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame, including: based on The optical flow from the first intermediate video frame to the first video frame modifies the first mapped video frame to obtain the first intermediate video frame; or, based on the first intermediate video frame to the first video frame The optical flow of the second video frame modifies the second mapped video frame to obtain the first intermediate video frame; or, based on the optical flow from the first intermediate video frame to the first video frame and/or The optical flow from the first intermediate video frame to the second video frame is modified by modifying the first fused video frame formed after the fusion of the first mapped video frame and the second mapped video frame, to obtain the The first intermediate video frame.
  • the first intermediate frame determining unit 340 determines, based on the optical flow from the first intermediate video frame to the first video frame, the mapping between the first mapped video frame and the first video frame.
  • Modifying the first fused video frame formed after the fusion of the second mapped video frame to obtain the first intermediate video frame includes: based on the first mapped video frame, the second mapped video frame and the first From the optical flow from the intermediate video frame to the first video frame, the third neural network is used to predict the first image correction item and the first fusion mask; according to the indication of the pixel value in the first fusion mask, the A mapped video frame and the second mapped video frame are fused into the first fused video frame; the first fused video frame is modified by using the first image modification item to obtain the first intermediate video frame.
  • the third neural network includes a second feature extraction network and a codec network
  • the codec network includes an encoder and a decoder
  • the first intermediate frame determination unit 340 uses a third neural network to predict a first image correction term and a first image correction term.
  • a fusion mask comprising: using the second feature extraction network to perform feature extraction on the first video frame and the first video frame respectively; using the difference between the first intermediate video frame and the first video frame
  • the optical flow performs backward mapping on the feature map extracted by the second feature extraction network; the mapped feature map obtained by mapping, the first mapped video frame, the second mapped video frame and the first intermediate video are mapped backwards;
  • the optical flow from the frame to the first video frame is input to the encoder for feature extraction; the decoder predicts a first image modification item and a first fusion mask according to the features extracted by the encoder.
  • model training device 300 provided by the embodiments of the present application, the implementation principle and the technical effects produced have been introduced in the foregoing method embodiments.
  • the parts not mentioned in the device embodiments reference may be made to the corresponding content in the method embodiments. .
  • FIG. 10 shows a functional block diagram of a model training apparatus 400 provided by an embodiment of the present application.
  • the model training device 400 includes:
  • a second video frame obtaining unit 410 configured to obtain training samples, where the training samples include a third video frame, a fourth video frame, and a reference video frame located between the third video frame and the fourth video frame;
  • the second optical flow calculation unit 420 is configured to use the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame and/or the third video frame based on the third video frame and the fourth video frame. Optical flow from two intermediate video frames to the fourth video frame; wherein, the second intermediate video frame is a video frame to be inserted between the third video frame and the fourth video frame;
  • a second backward mapping unit 430 configured to perform backward mapping on the third video frame by using the optical flow from the second intermediate video frame to the third video frame to obtain a third mapped video frame, and/or , using the optical flow from the second intermediate video frame to the fourth video frame to perform backward mapping on the fourth video frame to obtain a fourth mapped video frame;
  • a second intermediate frame determining unit 440 configured to determine the second intermediate video frame according to the third mapped video frame and/or the fourth mapped video frame;
  • a parameter updating unit 450 configured to calculate a prediction loss according to the second intermediate video frame and the reference video frame, and update parameters of the first neural network according to the prediction loss.
  • the parameter updating unit 450 calculates the prediction loss according to the second intermediate video frame and the reference video frame, including: according to the second intermediate video frame and the reference video frame Calculate the first loss; calculate the image gradient of the second intermediate video frame and the image gradient of the reference video frame respectively, and calculate the image gradient of the second intermediate video frame and the reference video frame Calculate the second loss; calculate the predicted loss according to the first loss and the second loss.
  • the parameter updating unit 450 calculates the prediction loss according to the second intermediate video frame and the reference video frame, including: according to the second intermediate video frame and the reference video frame Calculate the first loss; use the pre-trained fifth neural network to calculate the optical flow from the reference video frame to the third video frame and/or the optical flow from the reference video frame to the fourth video frame ; Calculate the third loss according to the difference between the optical flow calculated by the first neural network and the corresponding optical flow calculated by the fifth neural network; Calculate the said first loss and the third loss Predict loss.
  • the first neural network includes at least one optical flow calculation module connected in sequence, and each optical flow calculation module outputs the second intermediate video frame corrected by the module The optical flow to the third video frame; the parameter update unit 450 calculates the prediction loss according to the second intermediate video frame and the reference video frame, including: according to the second intermediate video frame and the reference video frame. Calculate the first loss by difference; use the pre-trained fifth neural network to calculate the optical flow from the reference video frame to the third video frame; according to the optical flow output by each optical flow calculation module and the fifth neural network A fourth loss is calculated from the difference between the calculated optical flows; the predicted loss is calculated according to the first loss and the fourth loss.
  • the parameter updating unit 450 calculates the third loss according to the difference between the optical flow calculated by the first neural network and the corresponding optical flow calculated by the fifth neural network, Including: using the optical flow calculated by the fifth neural network to perform backward mapping on the third video frame to obtain a fifth mapped video frame; determining according to the difference between the fifth mapped video frame and the reference video frame Whether the optical flow vector at each pixel position calculated by the fifth neural network is accurate; the first effective optical flow vector in the optical flow calculated according to the fifth neural network corresponds to the first effective optical flow vector calculated by the first neural network The difference of the second effective optical flow vector in the optical flow calculates the third loss; wherein, the first effective optical flow vector refers to the accurate optical flow vector calculated by the fifth neural network, and the second effective optical flow vector Refers to the optical flow vector located at the pixel position corresponding to the first effective optical flow vector in the corresponding optical flow calculated by the first neural network.
  • the parameter updating unit 450 calculates the fourth loss according to the difference between the optical flow output by each optical flow calculation module and the optical flow calculated by the fifth neural network, including: Use the optical flow calculated by the fifth neural network to perform backward mapping on the third video frame to obtain a fifth mapped video frame; determine the third video frame according to the difference between the fifth mapped video frame and the reference video frame Whether the optical flow vector at each pixel position calculated by the fifth neural network is accurate; the first effective optical flow vector in the optical flow and the optical flow output by each optical flow calculation module are calculated according to the fifth neural network.
  • the difference of the third effective optical flow vector calculates the fourth loss; wherein, the first effective optical flow vector refers to the accurate optical flow vector calculated by the fifth neural network, and the third effective optical flow vector refers to each The optical flow vector at the pixel position corresponding to the first effective optical flow vector in the optical flow output by the optical flow calculation module.
  • the first neural network includes at least one optical flow calculation module connected in sequence, and each optical flow calculation module utilizes a descriptor matching unit and a sub-pixel correction layer in the LiteFlownet network and the regularization layer modifies the optical flow input to the optical flow calculation module
  • the device further includes: a parameter initialization unit, used for the second optical flow calculation unit 420 based on the third video frame and the fourth video frame, before using the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame and/or the optical flow from the second intermediate video frame to the fourth video frame, using the LiteFlownet network pre-training obtained
  • the parameters initialize the parameters of the first neural network.
  • the second intermediate frame determining unit 440 determines the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame, including: based on the third mapped video frame Three mapped video frames, the fourth mapped video frame, and the optical flow from the second intermediate video frame to the third video frame, using the third neural network to predict the second image correction term and the third fusion mask;
  • the indication of pixel values in the third fusion mask fuses the third mapped video frame and the fourth mapped video frame into the second fused video frame; using the second image correction item to fuse the first Two fused video frames are modified to obtain the second intermediate video frame;
  • the parameter updating unit 450 calculates a prediction loss according to the second intermediate video frame and the reference video frame, and updates the first neural network according to the prediction loss
  • the parameters of the network include: calculating a prediction loss according to the second intermediate video frame and the reference video frame, and updating the parameters of the first neural network and the third neural network according to the prediction loss.
  • the second intermediate frame determining unit 440 determines the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame, including: based on the third mapped video frame Three mapped video frames, the fourth mapped video frame, and the optical flow from the second intermediate video frame to the third video frame, and the fourth neural network is used to predict the second image correction term and the fourth fusion mask;
  • the indication of pixel values in the fourth fusion mask fuses the third mapped video frame and the fourth mapped video frame into the second intermediate video frame;
  • the parameter update unit 450 fuses the second intermediate video frame according to the calculating a prediction loss with the reference video frame, and updating the parameters of the first neural network according to the prediction loss, including: calculating a prediction loss according to the second intermediate video frame and the reference video frame, and The prediction loss updates the parameters of the first neural network and the fourth neural network.
  • model training device 400 provided by the embodiments of the present application, the implementation principle and the technical effects produced have been introduced in the foregoing method embodiments.
  • the parts not mentioned in the device embodiments reference may be made to the corresponding content in the method embodiments. .
  • the embodiment of the present application also provides a video frame insertion device, including:
  • a third video frame obtaining unit used for obtaining the first video frame and the second video frame
  • a third optical flow calculation unit configured to calculate the optical flow and/or the optical flow from the first video frame to the first intermediate video frame by using the first neural network estimation based on the first video frame and the second video frame.
  • a first forward mapping unit configured to perform forward mapping on the first video frame by using the optical flow from the first video frame to the first intermediate video frame to obtain a first mapped video frame, and/or, Using the optical flow from the second video frame to the first intermediate video frame to perform forward mapping on the second video frame to obtain a second mapped video frame;
  • a third intermediate frame determining unit configured to determine the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame.
  • the above-mentioned video frame insertion device is similar to the video frame insertion device 300, and the difference mainly includes the use of forward mapping to replace the backward mapping in the video frame insertion device 300.
  • Various possible implementations of the video frame insertion device can also refer to The video frame insertion apparatus 300 will not be repeated.
  • the embodiment of the present application also provides a model training device, including:
  • a fourth video frame obtaining unit configured to obtain a training sample, the training sample includes a third video frame, a fourth video frame, and a reference video frame located between the third video frame and the fourth video frame;
  • a fourth optical flow calculation unit configured to use the first neural network to calculate the optical flow from the third video frame to the second intermediate video frame and/or the optical flow based on the third video frame and the fourth video frame The optical flow from the fourth video frame to the second intermediate video frame; wherein, the second intermediate video frame is the video frame to be inserted between the third video frame and the fourth video frame;
  • a second forward mapping unit configured to perform forward mapping on the third video frame by using the optical flow from the third video frame to the second intermediate video frame to obtain a third mapped video frame, and/or, use the first
  • the optical flow from the four video frames to the second intermediate video frame performs forward mapping on the fourth video frame to obtain a fourth mapped video frame
  • a third intermediate frame determining unit configured to determine the second intermediate video frame according to the third mapped video frame and/or the fourth mapped video frame
  • a second parameter updating unit configured to calculate a prediction loss according to the second intermediate video frame and the reference video frame, and update the parameters of the first neural network according to the prediction loss.
  • model training device is similar to the model training device 400, and the difference mainly includes using forward mapping to replace the backward mapping in the model training device 400.
  • model training device can also refer to the model training device 400. , will not be repeated.
  • FIG. 11 shows a possible structure of the electronic device 500 provided by the embodiment of the present application.
  • an electronic device 500 includes a processor 510, a memory 520, and a communication interface 530, and these components are interconnected and communicate with each other through a communication bus 540 and/or other forms of connection mechanisms (not shown).
  • the memory 520 includes one or more (only one is shown in the figure), which may be, but not limited to, a random access memory (Random Access Memory, RAM for short), a read only memory (Read Only Memory, ROM for short) , Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), Electrical Erasable Programmable Read-Only Memory, referred to as EEPROM) and so on.
  • the processor 510 and possibly other components may access the memory 520, read and/or write data therein.
  • the processor 510 includes one or more (only one is shown in the figure), which may be an integrated circuit chip, and has signal processing capability.
  • the above-mentioned processor 510 may be a general-purpose processor, including a central processing unit (Central Processing Unit, referred to as CPU), a micro control unit (Micro Controller Unit, referred to as MCU), a network processor (Network Processor, referred to as NP) or other conventional processing It can also be a dedicated processor, including a graphics processor (Graphics Processing Unit, GPU), a neural network processor (Neural-network Processing Unit, referred to as NPU), a digital signal processor (Digital Signal Processor, referred to as DSP), dedicated Integrated circuits (Application Specific Integrated Circuits, referred to as ASIC), Field Programmable Gate Array (Field Programmable Gate Array, referred to as FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Also, when there are multiple processors 510, some of them may be general-
  • the communication interface 530 includes one or more (only one is shown in the figure), which can be used to communicate directly or indirectly with other devices for data exchange.
  • Communication interface 530 may include an interface for wired and/or wireless communication.
  • One or more computer program instructions may be stored in the memory 520, and the processor 510 may read and execute these computer program instructions to implement the video frame insertion method and/or the model training method provided by the embodiments of the present application.
  • the structure shown in FIG. 11 is only for illustration, and the electronic device 500 may further include more or less components than those shown in FIG. 11 , or have different configurations from those shown in FIG. 11 .
  • Each component shown in FIG. 11 may be implemented in hardware, software, or a combination thereof.
  • the electronic device 500 may be a physical device, such as a PC, a notebook computer, a tablet computer, a mobile phone, a server, an embedded device, etc., or a virtual device, such as a virtual machine, a virtualized container, and the like.
  • the electronic device 500 is not limited to a single device, and may be a combination of a plurality of devices or a cluster composed of a large number of devices.
  • Embodiments of the present application further provide a computer-readable storage medium, where computer program instructions are stored on the computer-readable storage medium, and when the computer program instructions are read and run by a processor of a computer, the computer program instructions provided by the embodiments of the present application are executed.
  • Video frame insertion method For example, the computer-readable storage medium may be implemented as the memory 520 in the electronic device 500 in FIG. 11 .
  • the present application provides a video frame insertion method, a model training method and a corresponding device, which are applied to the process of video processing to achieve a better frame insertion effect, and the frame insertion process takes less time.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Television Systems (AREA)

Abstract

A video frame interpolation method, a model training method, and a corresponding device. The video frame interpolation method comprises: acquiring a first video frame and a second video frame; utilizing, on the basis of the first video frame and the second video frame, a first neural network to calculate an optical flow between the first video frame and a first intermediate video frame and/or an optical flow between the second video frame and the first intermediate video frame; utilizing the optical flow between the first video frame and the first intermediate video frame to reverse map the first video frame to acquire a first mapped video frame, and/or, utilizing the optical flow between the second video frame and the first intermediate video frame to reverse map the second video frame to acquire a second mapped video frame; and determining the first intermediate video frame on the basis of the first mapped video frame and/or of the second mapped video frame. In the method, the accuracy of calculating the optical flow of an intermediate frame is increased; therefore, the image quality of the first intermediate video frame ultimately acquired is improved, and the efficiency of frame interpolation using the method is increased.

Description

视频插帧方法、模型训练方法及对应装置Video frame insertion method, model training method and corresponding device
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2020年8月13日提交中国专利局的申请号为202010815538.3、名称为“视频插帧方法、模型训练方法及对应装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application No. 202010815538.3 and entitled "Video Frame Insertion Method, Model Training Method and Corresponding Device" filed with the China Patent Office on August 13, 2020, the entire contents of which are incorporated by reference in in this application.
技术领域technical field
本申请涉及视频处理技术领域,具体而言,涉及一种视频插帧方法、模型训练方法及对应装置。The present application relates to the technical field of video processing, and in particular, to a video frame insertion method, a model training method, and a corresponding device.
背景技术Background technique
视频插帧是视频处理中的一个经典任务,旨在根据一段视频中的前后两帧合成过渡平滑的中间帧。视频插帧的应用场景包括:第一,用于提高设备显示的视频帧率,让用户感觉视频更加清晰流畅;第二,在视频制作和编辑上,用于辅助实现视频的慢动作效果,或者用于在动画的关键帧之间增加中间帧,减少动画制作的人力支出;第三,用于视频的中间帧压缩,或者为其他计算机视觉任务提供辅助数据。Video frame interpolation is a classic task in video processing, which aims to synthesize intermediate frames with smooth transitions from two frames before and after a video. The application scenarios of video frame insertion include: first, it is used to improve the video frame rate displayed by the device, so that users can feel the video is clearer and smoother; second, in video production and editing, it is used to assist in realizing the slow motion effect of the video, or It is used to add intermediate frames between key frames of animation to reduce the labor cost of animation production; third, it is used for intermediate frame compression of video, or to provide auxiliary data for other computer vision tasks.
基于光流的视频插帧算法是近年来研究较多的一类算法,利用这类算法进行插帧的典型做法是:先训练一个光流计算网络,利用该网络计算出前后帧之间的光流,然后对前后帧之间的光流进行线性插值得到中间帧光流,最后基于中间帧光流得到中间帧,即要插入前后帧之间的帧。然而,由于中间帧光流是通过前后帧之间的光流合成的,所以在最终得到中间帧内,运动物体的边缘容易产生重影,导致插帧效果较差,并且现有算法步骤较为复杂,插帧过程耗时较长。The video frame insertion algorithm based on optical flow is a kind of algorithm that has been studied more in recent years. The typical method of using this kind of algorithm for frame insertion is to first train an optical flow calculation network, and use the network to calculate the optical flow between the frames before and after. Then, the optical flow between the front and rear frames is linearly interpolated to obtain the optical flow of the intermediate frame, and finally the intermediate frame is obtained based on the optical flow of the intermediate frame, that is, the frame to be inserted between the front and rear frames. However, since the optical flow of the intermediate frame is synthesized by the optical flow between the front and rear frames, in the final intermediate frame, the edge of the moving object is prone to ghosting, resulting in poor frame insertion effect, and the existing algorithm steps are more complicated , the frame insertion process takes a long time.
发明内容SUMMARY OF THE INVENTION
本申请实施例的目的包括提供一种视频插帧方法、模型训练方法及对应装置,以改善上述技术问题。The purpose of the embodiments of the present application includes providing a video frame insertion method, a model training method and a corresponding device, so as to improve the above technical problems.
为实现上述目的,本申请提供如下技术方案:To achieve the above purpose, the application provides the following technical solutions:
本申请实施例提供一种视频插帧方法,包括:获取第一视频帧和第二视频帧;基于所述第一视频帧和所述第二视频帧,利用第一神经网络计算出第一中间视频帧到所述第一视频帧的光流和/或第一中间视频帧到所述第二视频帧的光流;其中,所述第一中间视频帧为待插入所述第一视频帧和所述第二视频帧之间的视频帧;利用所述第一中间视频帧到所述第一视频帧的光流对所述第一视频帧进行后向映射,获得第一映射视频帧,和/或,利用所述第一中间视频帧到所述第二视频帧的光流对所述第二视频帧进行后向映射,获得第二映射视频帧;根据所述第一映射视频帧和/或所述第二映射视频帧确定所述第一中间视频帧。An embodiment of the present application provides a video frame insertion method, including: acquiring a first video frame and a second video frame; and calculating a first intermediate video frame based on the first video frame and the second video frame by using a first neural network The optical flow from the video frame to the first video frame and/or the optical flow from the first intermediate video frame to the second video frame; wherein, the first intermediate video frame is to be inserted into the first video frame and a video frame between the second video frames; using the optical flow of the first intermediate video frame to the first video frame to backward map the first video frame to obtain a first mapped video frame, and or, performing backward mapping on the second video frame by using the optical flow from the first intermediate video frame to the second video frame to obtain a second mapped video frame; according to the first mapped video frame and/or or the second mapped video frame determines the first intermediate video frame.
第一视频帧和第二视频帧为视频中的前后两帧(可以是连续两帧,也可以不是连续两帧),上述方法在进行插帧时,直接基于第一视频帧和第二视频帧,利用第一神经网络计算出中间帧光流(指第一中间视频帧到第一视频帧的光流和/或第一中间视频帧到第二视频帧的光流),而无需利用第一视频帧和第二视频帧之间的光流,从而得到的中间帧光流准确性较高,在此基础上获得的第一中间视频帧图像质量较好,在运动物体的边缘也不容易产生重影。另外,上述方法步骤简单,显著提高了插帧效率,从而在应用于实时插帧、高清视频插帧等场景时,也能取得较好的效果。The first video frame and the second video frame are two frames before and after the video (which may or may not be two consecutive frames), and the above method is directly based on the first video frame and the second video frame when inserting frames. , using the first neural network to calculate the optical flow of the intermediate frame (referring to the optical flow from the first intermediate video frame to the first video frame and/or the optical flow from the first intermediate video frame to the second video frame) without using the first The optical flow between the video frame and the second video frame, so that the optical flow of the intermediate frame obtained is more accurate, and the image quality of the first intermediate video frame obtained on this basis is better, and it is not easy to generate at the edge of moving objects. ghosting. In addition, the steps of the above method are simple, and the frame insertion efficiency is significantly improved, so that good results can also be achieved when applied to scenarios such as real-time frame insertion and high-definition video frame insertion.
在一种实现方式中,所述第一神经网络包括依次连接的至少一个光流计算模块,基于所述第一视频帧和所述第二视频帧,利用第一神经网络计算出第一中间视频帧到所述第一视频帧的光流,包括:根据所述第一视频帧确定输入每个光流计算模块的第一图像,以及,根据所述第二视频帧确定输入每个光流计算模块的第二图像;利用每个光流计算模块,基于输入该光流计算模块的光流分别对输入该光流计算模块的第一图像和第二图像进行后向映射,基于映射得到的第一映射图像和第二映射图像对输入该光流计算模块的光流进行修正,并输出修正后的光流;其中,输入第一个光流计算模块的光流为所述第一视频帧与所述第一中间视频帧之间的预设光流,输入其他光流计算模块的光流为上一个光流计算模块输出的光流,最后一个光流计算模块输出的光流为所述第一神经网络计算出的所述第一中间视频帧到所述第一视频帧的光流。In an implementation manner, the first neural network includes at least one optical flow calculation module connected in sequence, and based on the first video frame and the second video frame, the first neural network is used to calculate the first intermediate video The optical flow from the frame to the first video frame includes: determining the first image input to each optical flow calculation module according to the first video frame, and determining the input to each optical flow calculation module according to the second video frame The second image of the module; using each optical flow calculation module, based on the optical flow input to the optical flow calculation module, backward mapping is performed on the first image and the second image input into the optical flow calculation module, based on the first image obtained by the mapping A mapping image and a second mapping image correct the optical flow input to the optical flow calculation module, and output the corrected optical flow; wherein, the optical flow input to the first optical flow calculation module is the first video frame and the The preset optical flow between the first intermediate video frames, the optical flow input to other optical flow calculation modules is the optical flow output by the previous optical flow calculation module, and the optical flow output by the last optical flow calculation module is the first optical flow. Optical flow from the first intermediate video frame to the first video frame calculated by a neural network.
在上述实现方式中,通过在第一神经网络中设置至少一个光流计算模块,不断修正中间帧光流的计算结果,最终得到精确的中间帧光流。In the above implementation manner, by setting at least one optical flow calculation module in the first neural network, the calculation result of the optical flow of the intermediate frame is continuously corrected, and an accurate optical flow of the intermediate frame is finally obtained.
在一种实现方式中,所述根据所述第一视频帧确定输入每个光流计算模块的第一图像,以及,根据所述第二视频帧确定输入每个光流计算模块的第二图像,包括:将所述第一视频帧作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧作为输入每个光流计算模块的第二图像;或者,将所述第一视频帧下采样后得到的图像作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧下采样后得到的图像作为输入每个光流计算模块的第二图像;其中,同一光流计算模块输入的两个下采样图像的形状相同;或者,将所述第一视频帧经卷积层处理后输出的特征图作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧经卷积层处理后输出的特征图作为输入每个光流计算模块的第二图像;其中,同一光流计算模块输入的两个特征图的形状相同。In an implementation manner, the first image input to each optical flow calculation module is determined according to the first video frame, and the second image input to each optical flow calculation module is determined according to the second video frame , including: using the first video frame as the first image input to each optical flow calculation module, and using the second video frame as the second image input to each optical flow calculation module; or, using the The image obtained after the downsampling of the first video frame is used as the first image input to each optical flow calculation module, and the image obtained after the downsampling of the second video frame is used as the second image input to each optical flow calculation module ; wherein, the shapes of the two down-sampled images input by the same optical flow calculation module are the same; or, the feature map output after the first video frame is processed by the convolution layer is used as the first image input to each optical flow calculation module , and the feature map output by the second video frame after being processed by the convolution layer is used as the second image input to each optical flow calculation module; wherein, the shapes of the two feature maps input by the same optical flow calculation module are the same.
在一种实现方式中,所述将所述第一视频帧下采样后得到的图像作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧下采样后得到的图像作为输入每个光流计算模块的第二图像,包括:分别对所述第一视频帧和所述第二视频帧进行下采样,形成所述第一视频帧的图像金字塔和所述第二视频帧的图像金字塔,所述图像金字塔从顶层开始的每一层对应所述第一神经网络从第一个光流计算模块开始的一个光流计算模块;从两个图像金字塔的顶层开始向下进行逐层遍历,将位于同层的两个下采样图像分别作为输入该层对应的光流计算模块的第一图像和第二图像。In an implementation manner, the image obtained by downsampling the first video frame is used as the first image input to each optical flow calculation module, and the image obtained by downsampling the second video frame As the second image input to each optical flow calculation module, the method includes: down-sampling the first video frame and the second video frame respectively to form an image pyramid of the first video frame and the second video The image pyramid of the frame, each layer of the image pyramid starting from the top layer corresponds to an optical flow calculation module of the first neural network starting from the first optical flow calculation module; starting from the top layer of the two image pyramids and proceeding downwards The layer-by-layer traversal is performed, and the two down-sampled images located in the same layer are respectively used as the first image and the second image input to the optical flow calculation module corresponding to the layer.
在一种实现方式中,所述将所述第一视频帧经卷积层处理后输出的特征图作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧经卷积层处理后输出的特征图作为输入每个光流计算模块的第二图像,包括:利用第一特征提取网络分别对所述第一视频帧和所述第二视频帧进行特征提取,形成所述第一视频帧的特征金字塔和所述第二视频帧的特征金字塔,所述特征金字塔从顶层开始的每一层对应所述第一神经网络从第一个光流计算模块开始的一个光流计算模块;其中,所述第一特征提取网络为卷积神经网络;从两个特征金字塔的顶层开始向下进行逐层遍历,将位于同层的两个特征图分别作为输入该层对应的光流计算模块的第一图像和第二图像。In an implementation manner, the feature map output after the first video frame is processed by the convolution layer is used as the first image input to each optical flow calculation module, and the second video frame is convoluted The feature map output after the multi-layer processing is used as the second image input to each optical flow calculation module, including: using the first feature extraction network to perform feature extraction on the first video frame and the second video frame respectively, forming the The feature pyramid of the first video frame and the feature pyramid of the second video frame, each layer of the feature pyramid starting from the top layer corresponds to an optical flow of the first neural network starting from the first optical flow calculation module computing module; wherein, the first feature extraction network is a convolutional neural network; starting from the top layer of the two feature pyramids, the traversal is performed layer by layer, and the two feature maps located in the same layer are respectively used as the input light corresponding to the layer. The first image and the second image of the stream computing module.
在上述三种实现方式中,光流计算模块的输入既可以是原图(指第一视频帧或第二视频帧),也可以是下采样后的原图,还可以是特征图,非常灵活。其中,若采用特征图作为光流计算模块的输入,则需要进行卷积计算,其计算量较大,但由于在进行光流计算时考虑了图像中更多的深层次特征,因此光流计算结果也较为准确。而采用原图或下采样后的原图作为光流计算模块的输入,则无需进行卷积计算,其计算量较小,计算光流的效率较高。In the above three implementations, the input of the optical flow calculation module can be either the original image (referring to the first video frame or the second video frame), the original image after downsampling, or the feature map, which is very flexible . Among them, if the feature map is used as the input of the optical flow calculation module, convolution calculation is required, which requires a large amount of calculation. However, because more deep-level features in the image are considered in the optical flow calculation, the optical flow calculation The results are also more accurate. However, if the original image or the down-sampled original image is used as the input of the optical flow calculation module, there is no need to perform convolution calculation, the calculation amount is small, and the optical flow calculation efficiency is high.
其中,在采用下采样图像作为光流计算模块的输入时,可以先基于原图构建图像金字塔,然后从图像金字塔顶层(对应尺寸较小、精度较低的下采样图像)开始,逐层将下采样图像输入至对应的光流计算模块,以实现光流计算的逐渐精细化。类似的,在采用特征图作为光流计算模块的输入时,可以先基于原图构建特征金字塔,然后从特征金字塔顶层(对应尺寸较小、精度较低的特征图)开始,逐层将特征图输入至对应的光流计算模块,以实现光流计算的逐渐精细化。Among them, when using the down-sampled image as the input of the optical flow calculation module, the image pyramid can be constructed based on the original image, and then starting from the top layer of the image pyramid (corresponding to the down-sampled image with smaller size and lower precision), the down- The sampled image is input to the corresponding optical flow calculation module to realize the gradual refinement of the optical flow calculation. Similarly, when the feature map is used as the input of the optical flow calculation module, the feature pyramid can be constructed based on the original image, and then starting from the top layer of the feature pyramid (corresponding to the feature map with smaller size and lower precision), the feature map is divided layer by layer. Input to the corresponding optical flow calculation module to realize the gradual refinement of the optical flow calculation.
在一种实现方式中,所述基于映射得到的第一映射图像和第二映射图像对输入该光流计算模块的光流进行修正,并输出修正后的光流,包括:基于映射得到的第一映射图像、第二映射图像以及输入该光流计算模块的光流,利用第二神经网络预测出光流修正项;利用所述光流修正项对输入该光流计算模块的光流进行修正,并输出修正后的光流。In an implementation manner, the optical flow input to the optical flow calculation module is modified based on the first mapping image and the second mapping image obtained by mapping, and the modified optical flow is output, including: the first mapping image obtained based on the mapping. A mapping image, a second mapping image and the optical flow input to the optical flow calculation module, the second neural network is used to predict the optical flow correction term; the optical flow correction term is used to correct the optical flow input to the optical flow calculation module, And output the corrected optical flow.
在一种实现方式中,所述基于映射得到的第一映射图像和第二映射图像对输入该光流计算模块的光流进行修正,并输出修正后的光流,包括:基于映射得到的第一映射图像和第二映射图像,利用LiteFlownet网络中的描述子匹配单元、亚像素修正层以及正则化层对输入该光流计算模块的光流进行修正,并输出修正后的光流。In an implementation manner, the optical flow input to the optical flow calculation module is modified based on the first mapping image and the second mapping image obtained by mapping, and the modified optical flow is output, including: the first mapping image obtained based on the mapping. A mapping image and a second mapping image are used to correct the optical flow input to the optical flow calculation module by using the descriptor matching unit, sub-pixel correction layer and regularization layer in the LiteFlownet network, and output the corrected optical flow.
以上两种实现方式给出了对中间帧光流进行修正的两种方案,其一是直接迁移LiteFlownet网络中的光流修正结构,其二是设计一个第二神经网络用于光流修正。例如,该第二神经网络可以采用简单的编解码器架构,其运算量小,有利于快速完成光流修正。The above two implementations provide two schemes for correcting the optical flow of the intermediate frame. One is to directly transfer the optical flow correction structure in the LiteFlownet network, and the other is to design a second neural network for optical flow correction. For example, the second neural network can adopt a simple codec architecture, and its computational complexity is small, which is beneficial to complete the optical flow correction quickly.
在一种实现方式中,所述基于所述第一视频帧和所述第二视频帧,利用第一神经网络计算出第一中间视频帧到所述第一视频帧的光流和第一中间视频帧到所述第二视频帧的光流,包括:利用第一神经网络计算出第一中间视频帧到所述第一视频帧的光流,并根据所述第一中间视频帧到所述第一视频帧的光流计算所述第一中间视频帧到所述第二视频帧的光流;或者,利用第一神经网络计算出第一中间视频帧 到所述第二视频帧的光流,并根据所述第一中间视频帧到所述第二视频帧的光流计算所述第一中间视频帧到所述第一视频帧的光流。In an implementation manner, based on the first video frame and the second video frame, using a first neural network to calculate the optical flow from the first intermediate video frame to the first video frame and the first intermediate The optical flow from the video frame to the second video frame includes: using a first neural network to calculate the optical flow from the first intermediate video frame to the first video frame, and calculating the optical flow from the first intermediate video frame to the first video frame according to the first intermediate video frame. The optical flow of the first video frame calculates the optical flow from the first intermediate video frame to the second video frame; or, using the first neural network to calculate the optical flow from the first intermediate video frame to the second video frame , and calculate the optical flow from the first intermediate video frame to the first video frame according to the optical flow from the first intermediate video frame to the second video frame.
在上述实现方式中,第一中间视频帧到第一视频帧的光流与第一中间视频帧到第二视频帧的光流存在换算关系,从而只需得到其中一个就可以计算出另一个,无需通过第一神经网络进行两次光流计算,因此显著提高了光流计算的效率。In the above implementation manner, there is a conversion relationship between the optical flow from the first intermediate video frame to the first video frame and the optical flow from the first intermediate video frame to the second video frame, so that only one can be obtained to calculate the other, There is no need to perform the optical flow calculation twice through the first neural network, thus significantly improving the efficiency of the optical flow calculation.
在一种实现方式中,所述根据所述第一中间视频帧到所述第一视频帧的光流计算所述第一中间视频帧到所述第二视频帧的光流,包括:将所述第一中间视频帧到所述第一视频帧的光流取反后作为所述第一中间视频帧到所述第二视频帧的光流;所述根据所述第一中间视频帧到所述第二视频帧的光流计算所述第一中间视频帧到所述第一视频帧的光流,包括:将所述第一中间视频帧到所述第二视频帧的光流取反后作为所述第一中间视频帧到所述第一视频帧的光流。In an implementation manner, the calculating the optical flow from the first intermediate video frame to the second video frame according to the optical flow from the first intermediate video frame to the first video frame includes: calculating the optical flow from the first intermediate video frame to the second video frame. The optical flow from the first intermediate video frame to the first video frame is inverted as the optical flow from the first intermediate video frame to the second video frame; calculating the optical flow from the first intermediate video frame to the first video frame from the optical flow of the second video frame, including: inverting the optical flow from the first intermediate video frame to the second video frame as the optical flow of the first intermediate video frame to the first video frame.
在上述实现方式中,假定物体在第一视频帧和第二视频帧之间线性运动(运动轨迹为线性的匀速运动),则第一中间视频帧到第一视频帧的光流与第一中间视频帧到第二视频帧的光流互为相反光流(指两个光流方向相反,大小相同),计算起来简单高效。若第一视频帧和第二视频帧为连续的视频帧,或者在视频的帧率较高时,该假设容易满足,此时帧内物体的任意运动都可以近似为大量线性运动的累积。In the above implementation manner, it is assumed that the object moves linearly between the first video frame and the second video frame (the motion trajectory is a linear uniform motion), then the optical flow from the first intermediate video frame to the first video frame is the same as that of the first intermediate video frame. The optical flows from the video frame to the second video frame are mutually opposite optical flows (referring to the two optical flows in opposite directions and the same size), which is simple and efficient to calculate. If the first video frame and the second video frame are continuous video frames, or when the frame rate of the video is high, this assumption is easy to satisfy, and any motion of objects in the frame can be approximated as the accumulation of a large number of linear motions.
在一种实现方式中,所述根据所述第一映射视频帧和/或所述第二映射视频帧确定所述第一中间视频帧,包括:基于所述第一中间视频帧到所述第一视频帧的光流对所述第一映射视频帧进行修正,获得所述第一中间视频帧;或者,基于所述第一中间视频帧到所述第二视频帧的光流对所述第二映射视频帧进行修正,获得所述第一中间视频帧;或者,基于所述第一中间视频帧到所述第一视频帧的光流和/或所述第一中间视频帧到所述第二视频帧的光流,对所述第一映射视频帧和所述第二映射视频帧的融合后形成的第一融合视频帧进行修正,获得所述第一中间视频帧。In an implementation manner, the determining the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame includes: based on the first intermediate video frame to the first intermediate video frame The optical flow of a video frame modifies the first mapped video frame to obtain the first intermediate video frame; or, based on the optical flow from the first intermediate video frame to the second video frame, the Two-mapped video frames are modified to obtain the first intermediate video frame; or, based on the optical flow from the first intermediate video frame to the first video frame and/or the first intermediate video frame to the first intermediate video frame For the optical flow of two video frames, the first fused video frame formed by the fusion of the first mapped video frame and the second mapped video frame is modified to obtain the first intermediate video frame.
在上述实现方式中,对初步计算出的第一中间帧视频(指第一映射视频帧、第二映射视频帧或第一融合视频帧)进行修正,以提高图像质量,改善插帧效果。In the above implementation manner, the initially calculated first intermediate frame video (referring to the first mapped video frame, the second mapped video frame or the first merged video frame) is modified to improve image quality and frame insertion effect.
在一种实现方式中,基于所述第一中间视频帧到所述第一视频帧的光流,对所述第一映射视频帧和所述第二映射视频帧的融合后形成的第一融合视频帧进行修正,获得所述第一中间视频帧,包括:基于所述第一映射视频帧、所述第二映射视频帧以及所述第一中间视频帧到所述第一视频帧的光流,利用第三神经网络预测出第一图像修正项以及第一融合掩膜;根据所述第一融合掩膜中像素值的指示将所述第一映射视频帧和所述第二映射视频帧融合为所述第一融合视频帧;利用所述第一图像修正项对所述第一融合视频帧进行修正,获得所述第一中间视频帧。In an implementation manner, based on the optical flow from the first intermediate video frame to the first video frame, a first fusion formed after fusion of the first mapped video frame and the second mapped video frame is performed. Modifying the video frame to obtain the first intermediate video frame, comprising: based on the first mapped video frame, the second mapped video frame, and the optical flow from the first intermediate video frame to the first video frame , using the third neural network to predict the first image correction term and the first fusion mask; according to the instructions of the pixel values in the first fusion mask, the first mapped video frame and the second mapped video frame are fused is the first fused video frame; the first fused video frame is modified by using the first image modification item to obtain the first intermediate video frame.
在上述实现方式中,设计一个第三神经网络用于学习视频帧的融合以及修正的方法,有利于改善最终获得的第一中间视频帧的质量。In the above implementation manner, a third neural network is designed to learn the method of fusion and correction of video frames, which is beneficial to improve the quality of the finally obtained first intermediate video frame.
在一种实现方式中,所述第三神经网络包括第二特征提取网络以及编解码器网络,所述编解码器网络包括编码器以及解码器,所述基于所述第一映射视频帧、所述第二映射视频帧以及所述第一中间视频帧到所述第一视频帧的光流,利用第三神经网络预测出第一图像修正项以及第一融合掩膜,包括:利用所述第二特征提取网络分别对所述第一视频帧和所述第一视频帧进行特征提取;利用所述第一中间视频帧到所述第一视频帧的光流对所述第二特征提取网络提取得到的特征图进行后向映射;将映射得到的映射特征图、所述第一映射视频帧、所述第二映射视频帧以及所述第一中间视频帧到所述第一视频帧的光流输入至所述编码器进行特征提取;利用所述解码器根据所述编码器提取到的特征预测出第一图像修正项以及第一融合掩膜。In an implementation manner, the third neural network includes a second feature extraction network and a codec network, the codec network includes an encoder and a decoder, and the the optical flow from the second mapped video frame and the first intermediate video frame to the first video frame, and using the third neural network to predict the first image correction term and the first fusion mask, including: using the first image correction term Two feature extraction networks respectively perform feature extraction on the first video frame and the first video frame; use the optical flow from the first intermediate video frame to the first video frame to extract the second feature extraction network Perform backward mapping on the obtained feature map; map the obtained mapped feature map, the first mapped video frame, the second mapped video frame, and the optical flow from the first intermediate video frame to the first video frame Input to the encoder for feature extraction; use the decoder to predict the first image modification item and the first fusion mask according to the features extracted by the encoder.
在上述实现方式中,通过设计第二特征提取网络提取原图中的深层次特征(如边缘、纹理等),并将这些特征输入至编解码器网络中,有利于改善图像修正的效果。In the above implementation manner, the deep-level features (such as edges, textures, etc.) in the original image are extracted by designing the second feature extraction network, and these features are input into the codec network, which is beneficial to improve the effect of image correction.
在一种实现方式中,根据所述第一映射视频帧和所述第二映射视频帧确定所述第一中间视频帧,包括:基于所述第一映射视频帧、所述第二映射视频帧以及所述第一中间视频帧到所述第一视频帧的光流,利用第四神经网络预测出第二融合掩膜;根据所述第二融合掩膜中像素值的指示将所述第一映射视频帧和所述第二映射视频帧融合为所述第一中间视频帧。In an implementation manner, determining the first intermediate video frame according to the first mapped video frame and the second mapped video frame includes: based on the first mapped video frame and the second mapped video frame and the optical flow from the first intermediate video frame to the first video frame, using the fourth neural network to predict a second fusion mask; The mapped video frame and the second mapped video frame are fused into the first intermediate video frame.
在上述实现方式中,设计一个第四神经网络用于学习视频帧的融合的方法,有利于改善最终获得的第一中间视频帧的质量。In the above implementation manner, designing a fourth neural network for learning a method for fusion of video frames is beneficial to improve the quality of the finally obtained first intermediate video frames.
本申请实施例提供一种模型训练方法,包括:获取训练样本,所述训练样本包括第三视频帧、第四视频帧以及位于所述第三视频帧与所述第四视频帧之间的参考视频帧;基于所述第三视频帧和所述第四视频帧,利用第一神经网络计算第二中间视频帧到所述第三视频帧的光流和/或第二中间视频帧到所述第四视频帧的光流;其中,所述第二中间视频帧为待插入所述第三视频帧和所述第四视频帧之间的视频帧;利用所述第二中间视频帧到所述第三视频帧的光流对所述第三视频帧进行后向映射,获得第三映射视频帧,和/或,利用所述第二中间视频帧到所述第四视频帧的光流对所述第四视频帧进行后向映射,获得第四映射视频帧;根据所述第三映射视频帧和/或所述第四映射视频帧确定所述第二中间视频帧;根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络的参数。An embodiment of the present application provides a model training method, including: acquiring a training sample, where the training sample includes a third video frame, a fourth video frame, and a reference between the third video frame and the fourth video frame video frame; based on the third video frame and the fourth video frame, using the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame and/or the second intermediate video frame to the The optical flow of the fourth video frame; wherein, the second intermediate video frame is a video frame to be inserted between the third video frame and the fourth video frame; using the second intermediate video frame to The optical flow of the third video frame performs backward mapping on the third video frame to obtain a third mapped video frame, and/or, uses the optical flow from the second intermediate video frame to the fourth video frame to map the third video frame. performing backward mapping on the fourth video frame to obtain a fourth mapped video frame; determining the second intermediate video frame according to the third mapped video frame and/or the fourth mapped video frame; according to the second intermediate video frame A prediction loss is calculated for the video frame and the reference video frame, and the parameters of the first neural network are updated according to the prediction loss.
上述方法用于训练视频插帧方法中使用的第一神经网络,该神经网络可准确计算中间帧光流,改善插帧效果。The above method is used for training the first neural network used in the video frame insertion method, and the neural network can accurately calculate the optical flow of the intermediate frame and improve the frame insertion effect.
在一种实现方式中,所述根据所述第二中间视频帧和所述参考视频帧计算预测损失,包括:根据所述第二中间视频帧与所述参考视频帧的差异计算第一损失;分别计算所述第二中间视频帧的图像梯度以及所述参考视频帧的图像梯度,并根据所述第二中间视频帧的图像梯度与所述参考视频帧的图像梯度的差异计算第二损失;根据所述第一损失以及所述第二损失计算所述预测损失。In an implementation manner, the calculating the prediction loss according to the second intermediate video frame and the reference video frame includes: calculating a first loss according to a difference between the second intermediate video frame and the reference video frame; respectively calculating the image gradient of the second intermediate video frame and the image gradient of the reference video frame, and calculating a second loss according to the difference between the image gradient of the second intermediate video frame and the image gradient of the reference video frame; The predicted loss is calculated based on the first loss and the second loss.
上述实现方式中,在预测损失中加入表征梯度图像差异的第二损失,有利于改善生成的第二中间视频帧内的物体边缘模糊的问题。In the above implementation manner, adding a second loss representing the gradient image difference into the prediction loss is beneficial to improve the problem of blurred object edges in the generated second intermediate video frame.
在一种实现方式中,所述根据所述第二中间视频帧和所述参考视频帧计算预测损失,包括:根据所述第二中间视频帧与所述参考视频帧的差异计算第一损失;利用预训练的第五神经网络计算出所述参考视频帧到所述第三视频帧的光流和/或所述参考视频帧到所述第四视频帧的光流;根据所述第一神经网络计算出的光流与所述第五神经网络计算出的对应光流之间的差异计算第三损失;根据所述第一损失以及所述第三损失计算所述预测损失。In an implementation manner, the calculating the prediction loss according to the second intermediate video frame and the reference video frame includes: calculating a first loss according to a difference between the second intermediate video frame and the reference video frame; Calculate the optical flow from the reference video frame to the third video frame and/or the optical flow from the reference video frame to the fourth video frame by using the pre-trained fifth neural network; according to the first neural network A third loss is calculated based on the difference between the optical flow calculated by the network and the corresponding optical flow calculated by the fifth neural network; the predicted loss is calculated according to the first loss and the third loss.
在上述实现方式中,利用预训练的第五神经网络计算出光流作为标签对第一神经网络进行有监督的训练,实现了光流知识迁移(具体体现为在预测损失中加入了第三损失),有利于改善第一神经网络对中间帧光流的预测准确性,进而改善最终得到的第一中间视频帧的质量。In the above implementation manner, the optical flow calculated by the pre-trained fifth neural network is used as a label to perform supervised training on the first neural network, and the optical flow knowledge transfer is realized (specifically, a third loss is added to the prediction loss) , which is beneficial to improve the prediction accuracy of the optical flow of the intermediate frame by the first neural network, thereby improving the quality of the finally obtained first intermediate video frame.
在一种实现方式中,所述第一神经网络包括依次连接的至少一个光流计算模块,每个光流计算模块均输出经该模块修正后的所述第二中间视频帧到所述第三视频帧的光流;所述根据所述第二中间视频帧和所述参考视频帧计算预测损失,包括:根据所述第二中间视频帧与所述参考视频帧的差异计算第一损失;利用预训练的第五神经网络计算出所述参考视频帧到所述第三视频帧的光流;根据每个光流计算模块输出的光流与所述第五神经网络计算出的光流之间的差异计算第四损失;根据所述第一损失以及所述第四损失计算所述预测损失。In an implementation manner, the first neural network includes at least one optical flow calculation module connected in sequence, and each optical flow calculation module outputs the second intermediate video frame modified by the module to the third intermediate video frame. Optical flow of a video frame; the calculating a prediction loss according to the second intermediate video frame and the reference video frame includes: calculating a first loss according to the difference between the second intermediate video frame and the reference video frame; using The pre-trained fifth neural network calculates the optical flow from the reference video frame to the third video frame; according to the difference between the optical flow output by each optical flow calculation module and the optical flow calculated by the fifth neural network A fourth loss is calculated from the difference of ; the predicted loss is calculated according to the first loss and the fourth loss.
在上述实现方式中,利用预训练的第五神经网络计算出光流作为标签对第一神经网络进行有监督的训练,实现了光流知识迁移(具体体现为在预测损失中加入了第四损失),有利于改善第一神经网络对中间帧光流的预测准确性,进而改善最终得到的第一中间视频帧的质量。In the above implementation manner, the optical flow calculated by the pre-trained fifth neural network is used as a label to perform supervised training on the first neural network, and the optical flow knowledge transfer is realized (specifically, the fourth loss is added to the prediction loss) , which is beneficial to improve the prediction accuracy of the optical flow of the intermediate frame by the first neural network, thereby improving the quality of the finally obtained first intermediate video frame.
在第一神经网络包括至少一个光流计算模块时,光流计算结果是由粗到精逐渐产生的,从而可以对每个光流计算模块的输出都进行损失计算,并累加得到第四损失,通过计算第四损失,有利于更精确地调整每个光流计算模块的参数,使得每个光流计算模块的预测能力都得到提高。When the first neural network includes at least one optical flow calculation module, the optical flow calculation result is gradually generated from coarse to fine, so that the loss calculation can be performed on the output of each optical flow calculation module, and the fourth loss can be obtained by accumulating, By calculating the fourth loss, it is beneficial to adjust the parameters of each optical flow calculation module more accurately, so that the prediction ability of each optical flow calculation module is improved.
在一种实现方式中,所述根据所述第一神经网络计算出的光流与所述第五神经网络计算出的对应光流之间的差异计算第三损失,包括:利用所述第五神经网络计算出的光流对所述第三视频帧进行后向映射,获得第五映射视频帧;根据所述第五映射视频帧与所述参考视频帧的差异确定所述第五神经网络计算出的每个像素位置处的光流向量是否准确;根据所述第五神经网络计算出光流中的第一有效光流向量与所述第一神经网络计算出的对应光流中的第二有效光流向量的差异计算第三损失;其中,所述第一有效光流向量是指所述第五神经网络计算准确的光流向量,所述第二有效光流向量是指所述第一神经网络计算出的对应光流中位于所述第一有效光流向量对应的像素位置处的光流向量。In an implementation manner, the calculating the third loss according to the difference between the optical flow calculated by the first neural network and the corresponding optical flow calculated by the fifth neural network includes: using the fifth neural network to calculate the third loss. The optical flow calculated by the neural network performs backward mapping on the third video frame to obtain a fifth mapped video frame; determining the fifth neural network calculation according to the difference between the fifth mapped video frame and the reference video frame Whether the optical flow vector at each pixel position obtained is accurate; the first effective optical flow vector in the optical flow is calculated according to the fifth neural network and the second effective optical flow in the corresponding optical flow calculated by the first neural network is calculated. The difference of optical flow vectors calculates the third loss; wherein, the first effective optical flow vector refers to the accurate optical flow vector calculated by the fifth neural network, and the second effective optical flow vector refers to the first neural network. The optical flow vector located at the pixel position corresponding to the first effective optical flow vector in the corresponding optical flow calculated by the network.
在一种实现方式中,所述根据每个光流计算模块输出的光流与所述第五神经网络计算出的光流之间的差异计算第四损失,包括:利用所述第五神经网络计算出的光流对所述第三视频帧进行后向映射,获得第五映射视频帧;根据所述第五映射视频帧与所述参考视频帧的差异确定所述第五神经网络计算出的 每个像素位置处的光流向量是否准确;根据所述第五神经网络计算出光流中的第一有效光流向量与每个光流计算模块输出的光流中的第三有效光流向量的差异计算第四损失;其中,所述第一有效光流向量是指所述第五神经网络计算准确的光流向量,所述第三有效光流向量是指每个光流计算模块输出的光流中位于所述第一有效光流向量对应的像素位置处的光流向量。In an implementation manner, calculating the fourth loss according to the difference between the optical flow output by each optical flow calculation module and the optical flow calculated by the fifth neural network includes: using the fifth neural network The calculated optical flow performs backward mapping on the third video frame to obtain a fifth mapped video frame; according to the difference between the fifth mapped video frame and the reference video frame, determine the value calculated by the fifth neural network. Whether the optical flow vector at each pixel position is accurate; calculate the difference between the first effective optical flow vector in the optical flow and the third effective optical flow vector in the optical flow output by each optical flow calculation module according to the fifth neural network Difference calculation of the fourth loss; wherein, the first effective optical flow vector refers to the accurate optical flow vector calculated by the fifth neural network, and the third effective optical flow vector refers to the light output by each optical flow calculation module. The optical flow vector in the flow at the pixel position corresponding to the first effective optical flow vector.
发明人长期研究发现,第五神经网络在进行光流计算时,可能会由于边界和遮挡区域存在歧义等原因,导致在部分像素位置处计算出的光流向量不准确,对于这些光流向量,可不将其作为第一神经网络进行有监督学习的标签,而只将那些计算较准确的光流向量作为光流标签,即以上两种实现方式的内容。The inventor's long-term research has found that when the fifth neural network performs optical flow calculation, the optical flow vector calculated at some pixel positions may be inaccurate due to the ambiguity of the boundary and the occlusion area. For these optical flow vectors, Instead of using it as a label for supervised learning by the first neural network, only those optical flow vectors that are more accurately calculated are used as optical flow labels, that is, the content of the above two implementations.
在一种实现方式中,所述第一神经网络包括依次连接的至少一个光流计算模块,每个光流计算模块均利用LiteFlownet网络中的描述子匹配单元、亚像素修正层以及正则化层对输入该光流计算模块的光流进行修正,在所述基于所述第三视频帧和所述第四视频帧,利用第一神经网络计算第二中间视频帧到所述第三视频帧的光流和/或第二中间视频帧到所述第四视频帧的光流之前,所述方法还包括:利用LiteFlownet网络预训练得到的参数对所述第一神经网络的参数进行初始化。In an implementation manner, the first neural network includes at least one optical flow calculation module connected in sequence, and each optical flow calculation module utilizes a pair of descriptor matching units, sub-pixel correction layers and regularization layers in the LiteFlownet network. The optical flow input into the optical flow calculation module is corrected, and based on the third video frame and the fourth video frame, the first neural network is used to calculate the optical flow from the second intermediate video frame to the third video frame. Before the flow and/or the optical flow of the second intermediate video frame to the fourth video frame, the method further includes: initializing the parameters of the first neural network using parameters obtained by pre-training of the LiteFlownet network.
若第一神经网络中的光流计算模块是基于LiteFlownet网络进行结构迁移得到的,则在训练第一神经网络时,可以直接载入LiteFlownet网络的参数作为其参数的初始值,并在此基础上进行参数微调(finetune),不仅可以加快第一神经网络的收敛速度,还有利于改善其性能。If the optical flow calculation module in the first neural network is obtained by structure migration based on the LiteFlownet network, when training the first neural network, the parameters of the LiteFlownet network can be directly loaded as the initial value of its parameters, and on this basis Performing parameter fine-tuning can not only speed up the convergence speed of the first neural network, but also help to improve its performance.
在一种实现方式中,根据所述第三映射视频帧和所述第四映射视频帧确定所述第二中间视频帧,包括:基于所述第三映射视频帧、所述第四映射视频帧以及所述第二中间视频帧到所述第三视频帧的光流,利用第三神经网络预测第二图像修正项以及第三融合掩膜;根据所述第三融合掩膜中像素值的指示将所述第三映射视频帧和所述第四映射视频帧融合为第二融合视频帧;利用所述第二图像修正项对所述第二融合视频帧进行修正,获得所述第二中间视频帧;所述根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络的参数,包括:根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络以及所述第三神经网络的参数。In an implementation manner, determining the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame includes: based on the third mapped video frame, the fourth mapped video frame and the optical flow from the second intermediate video frame to the third video frame, using the third neural network to predict the second image correction term and the third fusion mask; according to the indication of pixel values in the third fusion mask fusing the third mapped video frame and the fourth mapped video frame into a second fused video frame; using the second image correction item to modify the second fused video frame to obtain the second intermediate video frame; calculating the prediction loss according to the second intermediate video frame and the reference video frame, and updating the parameters of the first neural network according to the prediction loss, including: according to the second intermediate video frame and the The prediction loss is calculated from the reference video frame, and the parameters of the first neural network and the third neural network are updated according to the prediction loss.
若在利用第一神经网络进行插帧时会采用第三神经网络进行图像修正,则在模型训练阶段第三神经网络可以和第一神经网络一起进行训练,有利于简化训练过程。If the third neural network is used for image correction when the first neural network is used for frame insertion, the third neural network can be trained together with the first neural network in the model training stage, which is beneficial to simplify the training process.
在一种实现方式中,根据所述第三映射视频帧和所述第四映射视频帧确定所述第二中间视频帧,包括:基于所述第三映射视频帧、所述第四映射视频帧以及所述第二中间视频帧到所述第三视频帧的光流,利用第四神经网络预测第二图像修正项以及第四融合掩膜;根据所述第四融合掩膜中像素值的指示将所述第三映射视频帧和所述第四映射视频帧融合为所述第二中间视频帧;所述根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络的参数,包括:根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络以及所述第四神经网络的参数。In an implementation manner, determining the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame includes: based on the third mapped video frame, the fourth mapped video frame and the optical flow from the second intermediate video frame to the third video frame, using the fourth neural network to predict the second image correction term and the fourth fusion mask; according to the indication of pixel values in the fourth fusion mask fusing the third mapped video frame and the fourth mapped video frame into the second intermediate video frame; calculating a prediction loss according to the second intermediate video frame and the reference video frame, and calculating the prediction loss according to the second intermediate video frame and the reference video frame Prediction loss updating the parameters of the first neural network includes: calculating prediction loss according to the second intermediate video frame and the reference video frame, and updating the first neural network and the fourth neural network according to the prediction loss parameters of the neural network.
若在利用第一神经网络进行插帧时会采用第四神经网络进行图像修正,则在模型训练阶段第四神经网络可以和第一神经网络一起进行训练,有利于简化训练过程。If the fourth neural network is used for image correction when the first neural network is used for frame insertion, the fourth neural network can be trained together with the first neural network in the model training stage, which is beneficial to simplify the training process.
本申请实施例提供一种视频插帧装置,包括:第一视频帧获取单元,用于获取第一视频帧和第二视频帧;第一光流计算单元,用于基于所述第一视频帧和所述第二视频帧,利用第一神经网络计算出第一中间视频帧到所述第一视频帧的光流和/或第一中间视频帧到所述第二视频帧的光流;其中,所述第一中间视频帧为待插入所述第一视频帧和所述第二视频帧之间的视频帧;第一后向映射单元,用于利用所述第一中间视频帧到所述第一视频帧的光流对所述第一视频帧进行后向映射,获得第一映射视频帧,和/或,利用所述第一中间视频帧到所述第二视频帧的光流对所述第二视频帧进行后向映射,获得第二映射视频帧;第一中间帧确定单元,用于根据所述第一映射视频帧和/或所述第二映射视频帧确定所述第一中间视频帧。An embodiment of the present application provides a device for video frame insertion, including: a first video frame obtaining unit, configured to obtain a first video frame and a second video frame; a first optical flow calculation unit, configured based on the first video frame and the second video frame, using the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame and/or the optical flow from the first intermediate video frame to the second video frame; wherein , the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame; a first backward mapping unit is configured to use the first intermediate video frame to The optical flow of the first video frame performs backward mapping on the first video frame to obtain a first mapped video frame, and/or, uses the optical flow from the first intermediate video frame to the second video frame to map the first video frame. performing backward mapping on the second video frame to obtain a second mapped video frame; a first intermediate frame determination unit, configured to determine the first intermediate frame according to the first mapped video frame and/or the second mapped video frame video frame.
本申请实施例提供一种模型训练装置,包括:第二视频帧获取单元,用于获取训练样本,所述训练样本包括第三视频帧、第四视频帧以及位于所述第三视频帧与所述第四视频帧之间的参考视频帧;第二光流计算单元,用于基于所述第三视频帧和所述第四视频帧,利用第一神经网络计算第二中间视频帧到所述第三视频帧的光流和/或第二中间视频帧到所述第四视频帧的光流;其中,所述第二中间视频帧为待插入所述第三视频帧和所述第四视频帧之间的视频帧;第二后向映射单元,用于利用所述第二中间视频 帧到所述第三视频帧的光流对所述第三视频帧进行后向映射,获得第三映射视频帧,和/或,利用所述第二中间视频帧到所述第四视频帧的光流对所述第四视频帧进行后向映射,获得第四映射视频帧;第二中间帧确定单元,用于根据所述第三映射视频帧和/或所述第四映射视频帧确定所述第二中间视频帧;参数更新单元,用于根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络的参数。An embodiment of the present application provides a model training apparatus, including: a second video frame obtaining unit, configured to obtain a training sample, where the training sample includes a third video frame, a fourth video frame, and a the reference video frame between the fourth video frames; the second optical flow calculation unit is configured to use the first neural network to calculate the second intermediate video frame to the The optical flow of the third video frame and/or the optical flow from the second intermediate video frame to the fourth video frame; wherein, the second intermediate video frame is to be inserted into the third video frame and the fourth video a video frame between frames; a second backward mapping unit, configured to perform backward mapping on the third video frame by using the optical flow from the second intermediate video frame to the third video frame to obtain a third mapping video frame, and/or, using the optical flow from the second intermediate video frame to the fourth video frame to perform backward mapping on the fourth video frame to obtain a fourth mapped video frame; a second intermediate frame determining unit , for determining the second intermediate video frame according to the third mapped video frame and/or the fourth mapped video frame; a parameter update unit for determining the second intermediate video frame according to the second intermediate video frame and the reference video frame A prediction loss is calculated, and parameters of the first neural network are updated according to the prediction loss.
本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序指令,所述计算机程序指令被处理器读取并运行时,执行本申请实施例中提供的视频插帧方法。The embodiments of the present application provide a computer-readable storage medium, where computer program instructions are stored on the computer-readable storage medium, and when the computer program instructions are read and run by a processor, the video provided in the embodiments of the present application is executed. Frame insertion method.
本申请实施例提供一种电子设备,包括:存储器以及处理器,所述存储器中存储有计算机程序指令,所述计算机程序指令被所述处理器读取并运行时,执行本申请实施例中提供的视频插帧方法。An embodiment of the present application provides an electronic device, including: a memory and a processor, where computer program instructions are stored in the memory, and when the computer program instructions are read and run by the processor, the computer program instructions provided in the embodiments of the present application are executed. video frame interpolation method.
附图说明Description of drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本申请的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the embodiments of the present application. It should be understood that the following drawings only show some embodiments of the present application, therefore It should not be regarded as a limitation of the scope. For those of ordinary skill in the art, other related drawings can also be obtained from these drawings without any creative effort.
图1示出了本申请实施例提供的视频插帧方法的一种可能的流程;Fig. 1 shows a possible flow of the video frame insertion method provided by the embodiment of the present application;
图2示出了本申请实施例提供的视频插帧方法的一种可能的网络架构;Fig. 2 shows a possible network architecture of the video frame insertion method provided by the embodiment of the present application;
图3示出了本申请实施例提供的第一神经网络的一种可能的结构;FIG. 3 shows a possible structure of the first neural network provided by the embodiment of the present application;
图4示出了一种通过特征金字塔构造第一图像和第二图像的方法;4 shows a method for constructing a first image and a second image by a feature pyramid;
图5示出了本申请实施例提供的第二神经网络的一种可能的结构;FIG. 5 shows a possible structure of the second neural network provided by the embodiment of the present application;
图6示出了本申请实施例提供的第三神经网络的一种可能的结构;FIG. 6 shows a possible structure of the third neural network provided by the embodiment of the present application;
图7示出了本申请实施例提供的模型训练方法的一种可能的流程;FIG. 7 shows a possible process of the model training method provided by the embodiment of the present application;
图8示出了本申请实施例提供的模型训练方法的一种可能的网络架构;FIG. 8 shows a possible network architecture of the model training method provided by the embodiment of the present application;
图9示出了本申请实施例提供的视频插帧装置的一种可能的结构;FIG. 9 shows a possible structure of a video frame insertion device provided by an embodiment of the present application;
图10示出了本申请实施例提供的视频插帧装置的另一种可能的结构;FIG. 10 shows another possible structure of the video frame insertion device provided by the embodiment of the present application;
图11示出了本申请实施例提供的电子设备的一种可能的结构。FIG. 11 shows a possible structure of the electronic device provided by the embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.
术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。The terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion such that a process, method, article or device comprising a list of elements includes not only those elements, but also other not expressly listed elements, or also include elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.
术语“第一”、“第二”等仅用于将一个实体或者操作与另一个实体或操作区分开来,而不能理解为指示或暗示相对重要性,也不能理解为要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。The terms "first", "second", etc. are only used to distinguish one entity or operation from another, and should not be construed to indicate or imply relative importance, nor to require or imply such entities or operations. There is no such actual relationship or sequence between operations.
图1示出了本申请实施例提供的视频插帧方法的一种可能的流程,图2则示出了该方法中可以采用的一种网络架构,供阐述视频插帧方法时参考。图1中的方法可以但不限于由图11示出的电子设备执行,关于该电子设备的结构,可以参考后文关于图11的阐述。参照图1,该方法包括:FIG. 1 shows a possible flow of the video frame insertion method provided by the embodiment of the present application, and FIG. 2 shows a network architecture that can be used in the method, for reference when describing the video frame insertion method. The method in FIG. 1 may be performed by, but is not limited to, the electronic device shown in FIG. 11 . For the structure of the electronic device, reference may be made to the following description about FIG. 11 . Referring to Figure 1, the method includes:
步骤S110:获取第一视频帧和第二视频帧。Step S110: Acquire the first video frame and the second video frame.
第一视频帧和第二视频帧为待插帧的视频中的前后两帧,第一视频帧和第二视频帧既可以是连续的前后两帧,也可以不是连续的前后两帧。除了二者之间的时序关系外,对第一视频帧和第二视频帧的选择本申请不作限定。为方便阐述,将第一视频帧记为I 1,第一视频帧记为I 2The first video frame and the second video frame are two frames before and after the video to be inserted, and the first video frame and the second video frame may be two consecutive frames before and after, or may not be two consecutive frames. Except for the timing relationship between the two, this application does not limit the selection of the first video frame and the second video frame. For the convenience of description, the first video frame is denoted as I 1 , and the first video frame is denoted as I 2 .
步骤S120:基于第一视频帧和第二视频帧,利用第一神经网络计算出第一中间视频帧到第一视频帧的光流和第一中间视频帧到第二视频帧的光流。Step S120: Based on the first video frame and the second video frame, use the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame and the optical flow from the first intermediate video frame to the second video frame.
第一中间视频帧为待插入I 1和I 2之间的视频帧,至于第一中间视频帧的插入位置本申请不限定,例如,可以是插入到I 1和I 2的正中间位置,也可以不是插入到I 1和I 2的正中间位置。为方便阐述,将第一中 间视频帧记为I syn1The first intermediate video frame is a video frame to be inserted between I 1 and I 2. The application does not limit the insertion position of the first intermediate video frame. For example, it can be inserted into the middle position of I 1 and I 2 , or May not be inserted in the middle of I 1 and I 2 . For the convenience of explanation, the first intermediate video frame is denoted as I syn1 .
所谓插帧,其主要是获得I syn1,至于将I syn1插入到视频当中是容易的。本申请的方案基于第一中间视频帧的光流获得I syn1,第一中间视频帧的光流包括第一中间视频帧到第一视频帧的光流和第一中间视频帧到第二视频帧的光流,前者记为Flow mid→1,后者记为Flow mid→2The so-called frame insertion is mainly to obtain I syn1 , and it is easy to insert I syn1 into the video. The solution of the present application obtains I syn1 based on the optical flow of the first intermediate video frame. The optical flow of the first intermediate video frame includes the optical flow from the first intermediate video frame to the first video frame and the optical flow from the first intermediate video frame to the second video frame. The optical flow of , the former is recorded as Flow mid→1 , and the latter is recorded as Flow mid→2 .
在一些实现方式中,可以将I 1和I 2输入到第一神经网络,利用第一神经网络分别预测出Flow mid→1和Flow mid→2In some implementations, I 1 and I 2 may be input to the first neural network, and Flow mid→1 and Flow mid→2 may be predicted by the first neural network, respectively.
若I 1和I 2内的物体运动符合某些运动规律,则Flow mid→1和Flow mid→2之间也存在与该规律相应的换算关系。因此,在另一些实现方式中,可以利用第一神经网络计算出Flow mid→1,并根据Flow mid→1换算出Flow mid→2,如图2所示(Flow mid→2未示出)。当然,利用第一神经网络计算出Flow mid→2,并根据Flow mid→2换算出Flow mid→1也是可以的。在这些实现方式中,只需利用第一神经网络进行一次光流计算就可以得到所需的两个光流,因此显著提高了光流计算的效率。 If the motion of objects in I 1 and I 2 conforms to certain motion laws, there is also a conversion relationship corresponding to this law between Flow mid→1 and Flow mid→2 . Therefore, in other implementation manners, Flow mid→1 may be calculated by using the first neural network, and Flow mid→2 may be converted according to Flow mid→1 , as shown in FIG. 2 (Flow mid→2 is not shown). Of course, it is also possible to use the first neural network to calculate Flow mid→2 , and to convert Flow mid→1 according to Flow mid→2 . In these implementations, the required two optical flows can be obtained only by using the first neural network to perform one optical flow calculation, thus significantly improving the efficiency of the optical flow calculation.
可选地,假定物体在I 1和I 2之间线性运动(运动轨迹为线性的匀速运动),则Flow mid→1和Flow mid→2互为相反光流,得到其中一个光流后进行取反就可以计算出另一个光流。所谓互为相反光流,用公式可以表示为Flow mid→1=-Flow mid→2,可以理解为Flow mid→1和Flow mid→2方向相反,大小相同。由于一个光流可以视为图像中每个像素位置处的光流向量的集合,因此求一个光流的相反光流,只需将该光流所包含的全部光流向量反向即可,计算起来简单、高效。由于帧内物体在长时间内的任意运动都可以近似为大量短时间内线性运动的累积,因此若I 1和I 2为连续的视频帧,或者在视频的帧率较高时,上述线性运动的假定容易满足,即采用上述方法进行光流换算具有高度的可行性。 Optionally, it is assumed that the object moves linearly between I 1 and I 2 (the motion trajectory is a linear uniform motion), then Flow mid→1 and Flow mid→2 are mutually opposite optical flows, and one optical flow is obtained and taken. Conversely, another optical flow can be calculated. The so-called mutually opposite optical flows can be expressed as Flow mid→1 =-Flow mid→2 by the formula, which can be understood as the opposite directions of Flow mid→1 and Flow mid→2 and the same size. Since an optical flow can be regarded as a set of optical flow vectors at each pixel position in the image, to find the opposite optical flow of an optical flow, it is only necessary to reverse all the optical flow vectors contained in the optical flow, and calculate Simple and efficient. Since any motion of an object in a frame over a long period of time can be approximated as the accumulation of a large number of linear motions in a short period of time, if I 1 and I 2 are continuous video frames, or when the frame rate of the video is high, the above linear motion The assumption is easy to satisfy, that is, it is highly feasible to use the above method for optical flow conversion.
以Flow mid→1=-Flow mid→2的情况为例,图3给出了一种可以计算出Flow mid→1的第一神经网络的结构。参照图3,第一神经网络包括依次连接的至少一个光流计算模块(图中示出了3个光流计算模块)。每个光流计算模块都用于对输入该模块的光流进行修正,并输出修正后的光流。 Taking the case of Flow mid→1 =-Flow mid→2 as an example, Figure 3 shows the structure of a first neural network that can calculate Flow mid→1 . Referring to FIG. 3 , the first neural network includes at least one optical flow calculation module connected in sequence (three optical flow calculation modules are shown in the figure). Each optical flow calculation module is used to correct the optical flow input to the module, and output the corrected optical flow.
其中,输入第一个光流计算模块(如图3中的光流计算模块1)的光流为一个预设的Flow mid→1,由于此时尚未进行任何光流计算,所以该预设光流可以取默认值,例如取零(指光流包含的所有光流向量都取零)。第一个光流计算模块对该预设的Flow mid→1进行修正后,输出修正结果,该修正结果也可以视为第一个光流计算模块计算出的Flow mid→1。对于第一个光流计算模块之后的每个光流计算模块,都对上一个光流计算模块输出的Flow mid→1进行修正,并输出修正结果,修正结果可视为该光流计算模块计算出的Flow mid→1。对于最后一个光流计算模块(如图3中的光流计算模块3),其输出的Flow mid→1就是第一神经网络最终计算出的光流。可以看出,在第一神经网络中,Flow mid→1的计算结果由粗到精不断进行修正,最终得到比较精确的光流计算结果。 Among them, the optical flow input to the first optical flow calculation module (optical flow calculation module 1 in Figure 3) is a preset Flow mid→1 . Since no optical flow calculation has been performed at this time, the preset optical flow The flow can take a default value, such as zero (meaning that all optical flow vectors contained in the optical flow take zero). After the first optical flow calculation module corrects the preset Flow mid→1 , it outputs a correction result, and the correction result can also be regarded as the Flow mid→1 calculated by the first optical flow calculation module. For each optical flow calculation module after the first optical flow calculation module, the Flow mid→1 output by the previous optical flow calculation module is corrected, and the correction result is output, and the correction result can be regarded as the calculation of the optical flow calculation module. out Flow mid→1 . For the last optical flow calculation module (optical flow calculation module 3 in Figure 3), the output Flow mid→1 is the optical flow finally calculated by the first neural network. It can be seen that in the first neural network, the calculation result of Flow mid→1 is continuously revised from coarse to fine, and finally a relatively accurate optical flow calculation result is obtained.
每个光流计算模块都具有类似的结构,如图3左侧的所示。光流计算模块的输入除了Flow mid→1之外,还包括第一图像和第二图像,为方便阐述,分别记为J 1和J 2,不过不同的光流计算模块输入的J 1和J 2并不一定相同。其中,J 1是根据I 1确定的,而J 2是根据I 2确定的,具体可以包括,但不限于如下方式之一: Each optical flow calculation module has a similar structure, as shown on the left side of Figure 3. In addition to Flow mid→1 , the input of the optical flow calculation module also includes the first image and the second image, which are denoted as J 1 and J 2 respectively for the convenience of explanation, but J 1 and J input by different optical flow calculation modules 2 are not necessarily the same. Wherein, J 1 is determined according to I 1 , and J 2 is determined according to I 2 , which may specifically include, but is not limited to, one of the following ways:
(1)直接将I 1作为J 1,将I 2作为J 2,且每个光流计算模块均输入I 1和I 2。方式(1)无需计算光流计算模块的输入,因此有利于提高光流计算的效率。 (1) Take I 1 as J 1 directly, take I 2 as J 2 , and input I 1 and I 2 to each optical flow calculation module. Mode (1) does not need to calculate the input of the optical flow calculation module, so it is beneficial to improve the efficiency of optical flow calculation.
(2)将I 1经卷积层处理后输出的特征图作为J 1,将I 2经卷积层处理后输出的特征图作为J 2。由于I 1、I 2经多个卷积层处理后可以输出多个尺度不同的特征图,因此,对于每个光流计算模块,可以输入尺度不同的特征图,但同一光流模块输入的J 1和J 2形状相同。方式(2)针对光流计算模块的输入需要进行卷积计算,其计算量较大,但由于在进行光流计算时考虑了图像中更多的深层次特征,因此光流计算结果也较为准确。 (2) The feature map output after I 1 is processed by the convolution layer is taken as J 1 , and the feature map output after I 2 is processed by the convolution layer is taken as J 2 . Since I 1 and I 2 can output multiple feature maps with different scales after being processed by multiple convolutional layers, for each optical flow calculation module, feature maps with different scales can be input, but the J input from the same optical flow module 1 and J 2 are the same shape. Method (2) requires convolution calculation for the input of the optical flow calculation module, which requires a large amount of calculation. However, because more deep-level features in the image are considered in the optical flow calculation, the optical flow calculation result is also more accurate. .
在一些实现方式中,可以利用第一特征提取网络分别对I 1和I 2进行特征提取,形成I 1的特征金字塔和I 2的特征金字塔,其中,第一特征提取网络为一个卷积神经网络,特征金字塔从顶层开始的每一层对应第一神经网络从第一个光流计算模块开始的一个光流计算模块,图像金字塔同层的特征形状相同。 In some implementations, a first feature extraction network may be used to extract features for I1 and I2 , respectively, to form a feature pyramid of I1 and a feature pyramid of I2 , wherein the first feature extraction network is a convolutional neural network , each layer of the feature pyramid starting from the top layer corresponds to an optical flow calculation module of the first neural network starting from the first optical flow calculation module, and the feature shapes of the same layer of the image pyramid are the same.
例如,参照图4,利用第一特征提取网络(图未示出)分别对I 1和I 2进行特征提取,获得两个3层的特征金字塔,对应图3中的3个光流计算模块,其中,第1层(顶层,即图中最靠近I 1和I 2的那层)对应光流计算模块1,第2层对应光流计算模块2,第3层(底层,即图中最远离I 1和I 2的那层)对应光流 计算模块3。特征金字塔的每一层都是一个特征图,在I 1的特征金字塔中,第i(i=1,2,3)层的特征图记为
Figure PCTCN2021085220-appb-000001
在I 2的特征金字塔中,第i层的特征图记为
Figure PCTCN2021085220-appb-000002
Figure PCTCN2021085220-appb-000003
具有相同的形状。
For example, referring to Fig. 4, using the first feature extraction network (not shown in the figure) to perform feature extraction on I 1 and I 2 respectively, to obtain two 3-layer feature pyramids, corresponding to the 3 optical flow calculation modules in Fig. 3, Among them, the first layer (the top layer, that is, the layer closest to I 1 and I 2 in the figure) corresponds to the optical flow calculation module 1, the second layer corresponds to the optical flow calculation module 2, and the third layer (the bottom layer, that is, the farthest away in the figure). The layers of I 1 and I 2 ) correspond to the optical flow calculation module 3. Each layer of the feature pyramid is a feature map. In the feature pyramid of I 1 , the feature map of the i-th (i=1, 2, 3) layer is marked as
Figure PCTCN2021085220-appb-000001
In the feature pyramid of I2 , the feature map of the i-th layer is denoted as
Figure PCTCN2021085220-appb-000002
and
Figure PCTCN2021085220-appb-000003
have the same shape.
构建好两个特征金字塔后,从两个特征金字塔的顶层开始向下进行逐层遍历,将位于同层的两个特征图分别作为该层对应的光流计算模块的J 1和J 2。例如,在图4中,将
Figure PCTCN2021085220-appb-000004
Figure PCTCN2021085220-appb-000005
分别作为图3中第i个光流计算模块的J 1和J 2
After constructing the two feature pyramids, traverse layer by layer from the top layer of the two feature pyramids downward, and use the two feature maps at the same layer as J 1 and J 2 of the corresponding optical flow calculation module of the layer. For example, in Figure 4, the
Figure PCTCN2021085220-appb-000004
and
Figure PCTCN2021085220-appb-000005
as J 1 and J 2 of the i-th optical flow calculation module in Fig. 3, respectively.
由于特征金字塔中的特征图尺寸从顶层到底层逐渐增大,顶层对应尺寸较小、精度较低的特征图,底层对应尺寸较大、精度较高的特征图,从而从特征金字塔顶层开始,逐层将特征图输入至对应的光流计算模块,有利于实现光流计算的逐渐精细化。但一般而言,根据卷积神经网络的特性,大尺寸特征图先被提取,小尺寸特征图后被提取,即特征金字塔的构建顺序是从底层至顶层。Since the size of the feature maps in the feature pyramid gradually increases from the top layer to the bottom layer, the top layer corresponds to the feature maps with smaller size and lower accuracy, and the bottom layer corresponds to the feature maps with larger size and higher accuracy. The layer inputs the feature map to the corresponding optical flow calculation module, which is conducive to the gradual refinement of the optical flow calculation. But generally speaking, according to the characteristics of the convolutional neural network, the large-size feature maps are extracted first, and the small-size feature maps are extracted later, that is, the construction order of the feature pyramid is from the bottom to the top.
需要指出的是,由于I 1和I 2自身也可以视为一种特殊的特征图,因此在方式(2)中也不排除将I 1和I 2作为第一个光流计算模块的J 1和J 2It should be pointed out that since I 1 and I 2 themselves can also be regarded as a special feature map, in mode (2), I 1 and I 2 are not excluded as J 1 of the first optical flow calculation module. and J 2 .
(3)将I 1经下采样后得到的图像作为J 1,将I 2经下采样后得到的图像作为J 2。由于I 1、I 2经多次下采样后可以输出多个尺度不同的下采样图像,因此,对于每个光流计算模块,可以输入尺度不同的下采样图像,但同一光流模块输入的J 1和J 2形状相同。方式(3)针对光流计算模块的输入只需进行简单的下采样计算,其计算量较小,因此有利于提高光流计算模块的计算效率。需注意的是,卷积操作在一定程度上也可以视为下采样,但方式(3)中的下采样应理解为不包含通过卷积进行下采样,例如,可以是按照下采样倍数直接间隔地抽取原图中的像素进行下采样。 (3) The image obtained after I 1 is down-sampled as J 1 , and the image obtained after I 2 is down-sampled as J 2 . Since I 1 and I 2 can output multiple down-sampled images with different scales after being down-sampled multiple times, for each optical flow calculation module, down-sampled images with different scales can be input, but the J input from the same optical flow module 1 and J 2 are the same shape. Mode (3) only needs to perform simple downsampling calculation for the input of the optical flow calculation module, and the calculation amount is small, so it is beneficial to improve the calculation efficiency of the optical flow calculation module. It should be noted that the convolution operation can also be regarded as downsampling to a certain extent, but the downsampling in mode (3) should be understood as not including downsampling through convolution, for example, it can be directly spaced according to the downsampling multiple. The pixels in the original image are extracted for downsampling.
在一些实现方式中,可以分别对I 1和I 2进行下采样,形成I 1的图像金字塔和I 2的图像金字塔,图像金字塔从顶层开始的每一层对应第一神经网络从第一个光流计算模块开始的一个光流计算模块,图像金字塔同层的图像形状相同。图像金字塔的结构与特征金字塔类似,只是构成金字塔的是下采样后的原始图像(指I 1或I 2)而非特征图。 In some implementations, I 1 and I 2 may be down-sampled to form an image pyramid of I 1 and an image pyramid of I 2 , and each layer of the image pyramid starting from the top layer corresponds to the first neural network from the first light An optical flow calculation module starting from the flow calculation module, the image shape of the same layer of the image pyramid is the same. The structure of the image pyramid is similar to that of the feature pyramid, except that the down-sampled original image (referred to as I 1 or I 2 ) instead of the feature map constitutes the pyramid.
构建好两个图像金字塔后,从两个图像金字塔的顶层开始向下进行逐层遍历,将位于同层的两个下采样图像分别作为该层对应的光流计算模块的J 1和J 2After constructing the two image pyramids, start from the top layer of the two image pyramids to traverse down layer by layer, and take the two down-sampled images at the same layer as J 1 and J 2 of the corresponding optical flow calculation module of the layer.
由于图像金字塔中的下采样图像尺寸从顶层到底层逐渐增大,顶层对应尺寸较小、精度较低的下采样图像,底层对应尺寸较大、精度较高的下采样图像,从而从图像金字塔顶层开始,逐层将下采样图像输入至对应的光流计算模块,有利于实现光流计算的逐渐精细化。但一般而言,根据下采样操作的特性,大尺寸的下采样图像先产生,小尺寸的下采样图像后产生,即图像金字塔的构建顺序是从底层至顶层。Since the size of the downsampled images in the image pyramid gradually increases from the top layer to the bottom layer, the top layer corresponds to a downsampled image with a smaller size and lower precision, and the bottom layer corresponds to a downsampled image with a larger size and higher precision. In the beginning, the down-sampled images are input to the corresponding optical flow calculation module layer by layer, which is conducive to the gradual refinement of the optical flow calculation. But generally speaking, according to the characteristics of the downsampling operation, the large-sized down-sampled images are generated first, and the small-sized down-sampled images are generated later, that is, the order of building the image pyramid is from the bottom layer to the top layer.
需要指出的是,由于I 1和I 2自身也可以视为一种特殊的下采样图像(下采样倍数为1),因此在方式(3)中也不排除将I 1和I 2作为第一个光流计算模块的J 1和J 2It should be pointed out that since I 1 and I 2 themselves can also be regarded as a special down-sampling image (the down-sampling multiple is 1), it is not excluded to use I 1 and I 2 as the first image in mode (3). J 1 and J 2 of the optical flow calculation modules.
继续参照图3,在光流计算模块中,基于输入该光流计算模块的Flow mid→1对输入该光流计算模块的J 1进行后向映射(backward warp),得到第一映射图像,记为
Figure PCTCN2021085220-appb-000006
即有
Figure PCTCN2021085220-appb-000007
以及对输入该光流计算模块的J 2进行后向映射,得到第二映射图像,记为
Figure PCTCN2021085220-appb-000008
即有
Figure PCTCN2021085220-appb-000009
Figure PCTCN2021085220-appb-000010
Continuing to refer to FIG. 3 , in the optical flow calculation module, backward warp is performed on J 1 input to the optical flow calculation module based on Flow mid→1 input to the optical flow calculation module to obtain a first mapped image, denoted by for
Figure PCTCN2021085220-appb-000006
that is
Figure PCTCN2021085220-appb-000007
and perform backward mapping on J 2 input to the optical flow calculation module to obtain a second mapping image, denoted as
Figure PCTCN2021085220-appb-000008
that is
Figure PCTCN2021085220-appb-000009
Figure PCTCN2021085220-appb-000010
光流计算模块中包括一个光流修正模块,光流修正模块以输入该光流计算模块的Flow mid→1以及上面的
Figure PCTCN2021085220-appb-000011
为输入,用于对Flow mid→1进行修正,并输出修正后的Flow mid→1,该修正后的Flow mid→1也是光流计算模块的输出。
The optical flow calculation module includes an optical flow correction module, and the optical flow correction module is used to input the Flow mid→1 of the optical flow calculation module and the above.
Figure PCTCN2021085220-appb-000011
is the input, which is used to correct Flow mid→1 , and output the corrected Flow mid→1 , which is also the output of the optical flow calculation module.
下面列举光流修正模块的两种实现方式,可以理解的,光流修正模块还可以采用其他实现方式:Two implementations of the optical flow correction module are listed below. It can be understood that the optical flow correction module can also adopt other implementations:
(1)设计一个第二神经网络,将输入光流计算模块的Flow mid→1
Figure PCTCN2021085220-appb-000012
输入至第二神经网络,利用第二神经网络预测出一个光流修正项Flow res,然后利用Flow res对输入光流计算模块的Flow mid→1进行修正,得到修正后的Flow mid→1。例如,在可选的方案中,将输入光流计算模块的Flow mid→1与Flow res相加(可以是直接相加,也可以是加权求和)得到修正后的Flow mid→1。其中,第二神经网络可以采用比较简单的网络结构,以便减少运算量,提高光流修正效率,从而也就是加快了光流计算模块计算光流的速度。
(1) Design a second neural network to input Flow mid→1 ,
Figure PCTCN2021085220-appb-000012
Input to the second neural network, use the second neural network to predict an optical flow correction term Flow res , and then use Flow res to correct the Flow mid→1 of the input optical flow calculation module to obtain the revised Flow mid→1 . For example, in an optional solution, the Flow mid→1 of the input optical flow calculation module is added to Flow res (which may be direct addition or weighted summation) to obtain the revised Flow mid→1 . Among them, the second neural network can adopt a relatively simple network structure, so as to reduce the amount of calculation and improve the efficiency of optical flow correction, thus speeding up the calculation speed of the optical flow by the optical flow calculation module.
第二神经网络可以采用一个编解码器网络,图5示出了第二神经网络的一种可能的结构。在图5中,网络的左侧部分(R1至R4)为编码器,右侧部分(D1至D4)为解码器。其中,Ri(i=1,2,3,4)表示一个编码模块,例如可以是一个残差块(Resblock),Di(i=1,2,3,4)表示一个解码模块,例如可以是一 个反卷积层。Flow mid→1
Figure PCTCN2021085220-appb-000013
三项数据拼接后输入至R1,除R4以外的每个编码模块除了将提取到的特征输入给下一个编码模块外,还会将其输入到解码器中,与对应的解码模块的输出相加以实现不同尺度上的特征融合,R4提取到的特征则直接输出给D4,D1输出第二神经网络预测的光流修正项Flow res。对第二神经网络的中间输出(指卷积层、反卷积层的输出)可以进行批归一化,并使用Prelu作为非线性激活函数。可以理解的,图5仅为示例,第二神经网络也可以采用其他结构。
The second neural network may employ an encoder-decoder network, and FIG. 5 shows a possible structure of the second neural network. In Figure 5, the left part of the network (R1 to R4) is the encoder and the right part (D1 to D4) is the decoder. Among them, Ri (i=1, 2, 3, 4) represents a coding module, for example, it can be a residual block (Resblock), and Di (i=1, 2, 3, 4) represents a decoding module, for example, it can be A deconvolution layer. Flow mid→1 ,
Figure PCTCN2021085220-appb-000013
The three items of data are spliced and input to R1. In addition to inputting the extracted features to the next encoding module, each encoding module except R4 will also input them into the decoder, and add them to the output of the corresponding decoding module. To achieve feature fusion on different scales, the features extracted by R4 are directly output to D4, and D1 outputs the optical flow correction item Flow res predicted by the second neural network. The intermediate output of the second neural network (referring to the output of the convolutional layer and the deconvolutional layer) can be batch normalized, and use Prelu as the nonlinear activation function. It can be understood that FIG. 5 is only an example, and the second neural network can also adopt other structures.
(2)直接迁移LiteFlownet网络中的光流修正结构。LiteFlownet网络是一种可用于光流计算的现有网络,但LiteFlownet网络仅能用于计算前后帧之间的光流,例如第一视频帧到第二视频帧的光流Flow 1→2,不能用于计算中间帧光流Flow mid→1(2) Directly transfer the optical flow correction structure in the LiteFlownet network. The LiteFlownet network is an existing network that can be used for optical flow calculation, but the LiteFlownet network can only be used to calculate the optical flow between the frames before and after, such as the optical flow Flow 1→2 from the first video frame to the second video frame, not Used to calculate the intermediate frame optical flow Flow mid→1 .
在LiteFlownet网络中的NetE部分也有在作用上类似于光流修正模块的结构,称为光流推理模块(flow inference module),该结构大致可以分为三个部分:描述子匹配单元(descriptor matching unit)、亚像素修正层(sub-pixel refinement unit)以及正则化层(regularization module)。In the NetE part of the LiteFlownet network, there is also a structure similar to the optical flow correction module, which is called the optical flow inference module. This structure can be roughly divided into three parts: the descriptor matching unit. ), sub-pixel refinement unit, and regularization module.
上述光流推理模块可以直接迁移到本申请的光流修正模块中,但需要对各部分的输入进行一定程度的改造:The above optical flow reasoning module can be directly migrated to the optical flow correction module of this application, but the input of each part needs to be modified to a certain extent:
其中,描述子匹配单元的输入改造为
Figure PCTCN2021085220-appb-000014
以及修正前的Flow mid→1,在描述子匹配单元中计算
Figure PCTCN2021085220-appb-000015
Figure PCTCN2021085220-appb-000016
之间的匹配代价容量(cost volumn),并将
Figure PCTCN2021085220-appb-000017
修正前的Flow mid→1、计算出的匹配代价容量四项信息输入至描述子匹配单元中的卷积神经网络进行计算,最终输出描述子匹配单元计算出的Flow mid→1。其中,匹配代价容量用于衡量映射图像
Figure PCTCN2021085220-appb-000018
Figure PCTCN2021085220-appb-000019
之间的重合度。
Among them, the input of the descriptor matching unit is transformed as
Figure PCTCN2021085220-appb-000014
and Flow mid→1 before correction, calculated in the descriptor matching unit
Figure PCTCN2021085220-appb-000015
and
Figure PCTCN2021085220-appb-000016
matching cost volume between
Figure PCTCN2021085220-appb-000017
The four items of information of Flow mid→1 before correction and the calculated matching cost capacity are input to the convolutional neural network in the descriptor matching unit for calculation, and finally the Flow mid→1 calculated by the descriptor matching unit is output. Among them, the matching cost capacity is used to measure the mapping image
Figure PCTCN2021085220-appb-000018
and
Figure PCTCN2021085220-appb-000019
coincidence between.
亚像素修正层的输入改造为
Figure PCTCN2021085220-appb-000020
以及描述子匹配单元输出的Flow mid→1,亚像素修正层在亚像素的精度上对输入的Flow mid→1进行修正,输出修正后的Flow mid→1
The input of the subpixel correction layer is transformed as
Figure PCTCN2021085220-appb-000020
As well as describing the Flow mid→1 output by the sub-matching unit, the sub-pixel correction layer corrects the input Flow mid→1 at sub-pixel accuracy, and outputs the revised Flow mid→1 .
正则化层的输入改造为
Figure PCTCN2021085220-appb-000021
以及亚像素修正层输出的Flow mid→1,正则化层对输入的Flow mid→1进行平滑,输出修正后的Flow mid→1,也就是光流修正模块的输出。
The input of the regularization layer is transformed as
Figure PCTCN2021085220-appb-000021
And the Flow mid→1 output by the sub-pixel correction layer, the regularization layer smoothes the input Flow mid→1 , and outputs the corrected Flow mid→1 , which is the output of the optical flow correction module.
此外,在LiteFlownet网络中的NetC部分会构建特征金字塔,从而这部分卷积层也可以迁移到本申请的方案中作为第一特征提取网络,用于提取
Figure PCTCN2021085220-appb-000022
Figure PCTCN2021085220-appb-000023
作为光流计算模块输入的J 1和J 2
In addition, a feature pyramid will be constructed in the NetC part of the LiteFlownet network, so that this part of the convolutional layer can also be migrated to the solution of this application as the first feature extraction network for extracting
Figure PCTCN2021085220-appb-000022
and
Figure PCTCN2021085220-appb-000023
J 1 and J 2 as input to the optical flow calculation module.
方式(2)相较于方式(1)而言,有效迁移了现有的光流计算成果,但由于LiteFlownet网络中包含较多的算子,因此在运算上会复杂一些。Compared with the method (1), the method (2) effectively migrates the existing optical flow calculation results, but because the LiteFlownet network contains more operators, the operation will be more complicated.
步骤S130:利用第一中间视频帧到第一视频帧的光流对第一视频帧进行后向映射,获得第一映射视频帧,以及,利用第一中间视频帧到第二视频帧的光流对第二视频帧进行后向映射,获得第二映射视频帧。Step S130: performing backward mapping on the first video frame by using the optical flow from the first intermediate video frame to the first video frame to obtain the first mapped video frame, and using the optical flow from the first intermediate video frame to the second video frame Perform backward mapping on the second video frame to obtain a second mapped video frame.
步骤S120中计算出Flow mid→1后,在步骤S130中利用Flow mid→1可以对I 1进行后向映射,得到第一映射视频帧,记为
Figure PCTCN2021085220-appb-000024
即有
Figure PCTCN2021085220-appb-000025
以及对I 2进行后向映射,得到第二映射视频帧,记为
Figure PCTCN2021085220-appb-000026
即有
Figure PCTCN2021085220-appb-000027
如图2所示。
After Flow mid→1 is calculated in step S120, in step S130, Flow mid→1 can be used to perform backward mapping on I 1 to obtain the first mapped video frame, denoted as
Figure PCTCN2021085220-appb-000024
that is
Figure PCTCN2021085220-appb-000025
And perform backward mapping to I 2 to obtain the second mapped video frame, denoted as
Figure PCTCN2021085220-appb-000026
that is
Figure PCTCN2021085220-appb-000027
as shown in picture 2.
步骤S140:根据第一映射视频帧和第二映射视频帧确定第一中间视频帧。Step S140: Determine the first intermediate video frame according to the first mapped video frame and the second mapped video frame.
在一些实现方式中,可以先将
Figure PCTCN2021085220-appb-000028
Figure PCTCN2021085220-appb-000029
进行融合,获得第一融合视频帧,记为I fusion1,然后再根据Flow mid→1和/或Flow mid→2对I fusion1进行修正,并将修正得到的图像作为I syn1,此举有利于提高I syn1的图像质量,改善插帧效果。其中,若Flow mid→1和Flow mid→2之间存在换算关系,则可以只根据Flow mid→1或Flow mid→2对I fusion1进行修正上述帧融合以及图像修正的过程可以先后执行,例如,先对
Figure PCTCN2021085220-appb-000030
Figure PCTCN2021085220-appb-000031
求平均得到I fusion1,再设计一个神经网络对I fusion1进行修正。不过,帧融合以及图像修正的过程也可以都基于一个神经网络实现,即利用该神经网络同时学习视频帧的融合以及图像修正的方法,如图2所示。
In some implementations, the
Figure PCTCN2021085220-appb-000028
and
Figure PCTCN2021085220-appb-000029
Perform fusion to obtain the first fusion video frame, denoted as I fusion1 , and then correct I fusion1 according to Flow mid→1 and/or Flow mid→2 , and use the corrected image as I syn1 , which is conducive to improving Image quality of I syn1 , improved frame insertion. Wherein, if there is a conversion relationship between Flow mid→1 and Flow mid→2 , then I fusion1 can be corrected only according to Flow mid→1 or Flow mid→2 . The above frame fusion and image correction processes can be performed successively, for example, first right
Figure PCTCN2021085220-appb-000030
and
Figure PCTCN2021085220-appb-000031
Average to get I fusion1 , and then design a neural network to correct I fusion1 . However, the process of frame fusion and image correction can also be implemented based on a neural network, that is, using the neural network to learn the methods of video frame fusion and image correction at the same time, as shown in Figure 2.
在图2中,首先,将
Figure PCTCN2021085220-appb-000032
Flow mid→1输入至第三神经网络,利用第三神经网络预测出第一图像修正项以及第一融合掩膜,分别记为I res1和mask1。
In Figure 2, first, the
Figure PCTCN2021085220-appb-000032
Flow mid→1 is input to the third neural network, and the third neural network is used to predict the first image correction item and the first fusion mask, which are denoted as I res1 and mask1 respectively.
然后,根据mask1中像素值的指示将
Figure PCTCN2021085220-appb-000033
Figure PCTCN2021085220-appb-000034
融合为I fusion1。比如,mask1中的每个像素值只能取0或1,弱某个位置处的像素值为0,表示I fusion1在该位置处的像素值取
Figure PCTCN2021085220-appb-000035
在该位置处的像素值,若某个位置处的像素值为1,表示I fusion1在该位置处的像素值取
Figure PCTCN2021085220-appb-000036
在该位置处的像素值。
Then, as indicated by the pixel values in mask1,
Figure PCTCN2021085220-appb-000033
and
Figure PCTCN2021085220-appb-000034
Fusion is I fusion1 . For example, each pixel value in mask1 can only take 0 or 1, and the pixel value at a weak position is 0, indicating that the pixel value of I fusion1 at this position is
Figure PCTCN2021085220-appb-000035
The pixel value at this position, if the pixel value at a certain position is 1, it means that the pixel value of I fusion1 at this position is
Figure PCTCN2021085220-appb-000036
The pixel value at this location.
最后,利用I res1对I fusion1进行修正,得到I syn1。例如,在可选的方案中将I fusion1与I res1相加(可以是直接相加,也可以是加权求和),得到I syn1,在直接相加时,有I syn1=I fusion1+I res1Finally, I fusion1 is corrected by I res1 to obtain I syn1 . For example, in the optional scheme, I fusion1 and I res1 are added (it can be a direct addition or a weighted summation) to obtain I syn1 , in the case of direct addition, I syn1 =I fusion1 +I res1 .
下面举例说明第三神经网络的结构,在一些实现方式中,第三神经网络包括第二特征提取网络和编解码器网络,其工作原理为:首先第二特征提取网络分别对I 1和I 2进行特征提取,然后利用Flow mid→1对第二特征提取网络提取得到的特征图进行后向映射,再将映射得到的映射特征图、
Figure PCTCN2021085220-appb-000037
以及Flow mid→1输入至编解码器网络的编码器进行特征提取,最后利用编解码器网络的解码器根据编码器提取到的特征预测出I res1以及mask1。
The structure of the third neural network is exemplified below. In some implementations, the third neural network includes a second feature extraction network and a codec network. Its working principle is as follows: first, the second feature extraction network compares I 1 and I 2 respectively. Perform feature extraction, and then use Flow mid→1 to perform backward mapping on the feature map extracted by the second feature extraction network, and then map the mapped feature map obtained by mapping,
Figure PCTCN2021085220-appb-000037
And Flow mid→1 is input to the encoder of the encoder-decoder network for feature extraction, and finally the decoder of the encoder-decoder network is used to predict I res1 and mask1 according to the features extracted by the encoder.
图6给出了一个符合上述描述的第三神经网络的实现方式。参照图6,网络的左侧部分(C1至C3)为第二特征提取网络,右侧部分为编解码器网络,其中编解码器网络的主要结构和图5类似,不再重点阐述。在第二特征提取网络中,Ci(i=1,2,3)表示一个或多个卷积层,从而利用第二特征提取网络构建了两个3层的特征金字塔,在基于I 1构建的特征金字塔中,第i(i=1,2,3)层的特征图记为F 1-i(F 1-1为底层,F 1-3为顶层),在基于I 2构建的特征金字塔中,第i层的特征图记为F 2-i(F 2-1为底层,F 2-3为顶层),F 1-i和F 2-i具有相同的形状。在图6中,基于Flow mid→1,对特征图F 1-i和F 2-i分别做后向映射,得到的映射特征图记为warp(F 1-i)和warp(F 2-i)。然后将warp(F 1-i)和warp(F 2-i)和编码模块Ri的输出拼接后作为编码模块Ri+1的输入。可以理解的,图6仅为示例,第三神经网络也可以采用其他结构。 Figure 6 shows an implementation of the third neural network in accordance with the above description. Referring to FIG. 6 , the left part (C1 to C3) of the network is the second feature extraction network, and the right part is the codec network, wherein the main structure of the codec network is similar to that in FIG. In the second feature extraction network, Ci (i= 1 , 2, 3) represents one or more convolutional layers, so that two 3-layer feature pyramids are constructed using the second feature extraction network. In the feature pyramid, the feature map of the i-th (i=1, 2, 3) layer is denoted as F 1-i (F 1-1 is the bottom layer, F 1-3 is the top layer), in the feature pyramid constructed based on I 2 , the feature map of the i-th layer is denoted as F 2-i (F 2-1 is the bottom layer, F 2-3 is the top layer), and F 1-i and F 2-i have the same shape. In Fig. 6, based on Flow mid→1 , the feature maps F 1-i and F 2-i are respectively backward mapped, and the obtained mapped feature maps are denoted as warp(F 1-i ) and warp(F 2-i ) ). Then warp(F 1-i ) and warp(F 2-i ) are concatenated with the output of the encoding module Ri as the input of the encoding module Ri+1. It can be understood that FIG. 6 is only an example, and the third neural network can also adopt other structures.
在上面的实现方式中,通过设计第二特征提取网络提取原图中的深层次特征(如边缘、纹理等),并将这些特征输入至编解码器网络中,有利于改善图像修正的效果。In the above implementation manner, the deep-level features (such as edges, textures, etc.) in the original image are extracted by designing the second feature extraction network, and these features are input into the codec network, which is beneficial to improve the effect of image correction.
在图2示出的方案中,通过第三神经网络预测I res1和mask1,但另在一些实现方式中,可以可选地简化该方案:首先将
Figure PCTCN2021085220-appb-000038
Flow mid→1输入至第四神经网络,然后利用第四神经网络预测出第二融合掩膜,记为mask2,最后根据mask2中像素值的指示将
Figure PCTCN2021085220-appb-000039
Figure PCTCN2021085220-appb-000040
融合为直接融合为I syn1。这些实现方式由于不需计算I res1,所以计算过程更为简单,并且第四神经网络也可以专注于融合掩膜的学习。其中,第四神经网络的设计可以参考第三神经网络,这里不再详细说明。
In the scheme shown in Figure 2, I res1 and mask1 are predicted by the third neural network, but in other implementations, the scheme can be optionally simplified: first, the
Figure PCTCN2021085220-appb-000038
Flow mid→1 is input to the fourth neural network, and then the fourth neural network is used to predict the second fusion mask, denoted as mask2, and finally according to the indication of the pixel value in mask2
Figure PCTCN2021085220-appb-000039
and
Figure PCTCN2021085220-appb-000040
fused to directly fused to I syn1 . These implementations do not need to calculate I res1 , so the calculation process is simpler, and the fourth neural network can also focus on learning the fusion mask. The design of the fourth neural network may refer to the third neural network, which will not be described in detail here.
在另一些实现方式中,还可以直接融合
Figure PCTCN2021085220-appb-000041
Figure PCTCN2021085220-appb-000042
例如直接将二者求平均得到I syn1,这些实现方式计算过程极为简单,但获得的中间帧质量会差一些。
In other implementations, it is also possible to directly fuse
Figure PCTCN2021085220-appb-000041
and
Figure PCTCN2021085220-appb-000042
For example, the two are directly averaged to obtain I syn1 . The calculation process of these implementations is extremely simple, but the quality of the obtained intermediate frame will be poorer.
在图2示出的方案中,第一中间视频帧是通过第一映射视频帧和第二映射视频帧融合产生的(可能还会进行修正),但也存在一些方案,第一中间视频帧是直接基于第一映射视频帧或第二映射视频帧产生的(可能还会进行修正)。这些方案的具体步骤如下:In the scheme shown in FIG. 2, the first intermediate video frame is generated by fusion of the first mapped video frame and the second mapped video frame (may be modified), but there are also some schemes, the first intermediate video frame is Directly based on the first mapped video frame or the second mapped video frame (possibly with modifications). The specific steps of these programs are as follows:
方案AScheme A
步骤A1:获取第一视频帧和第二视频帧;Step A1: obtaining the first video frame and the second video frame;
步骤A2:基于第一视频帧和第二视频帧,利用第一神经网络计算出第一中间视频帧到第一视频帧的光流;Step A2: Based on the first video frame and the second video frame, use the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame;
步骤A3:利用第一中间视频帧到第一视频帧的光流对第一视频帧进行后向映射,获得第一映射视频帧;Step A3: using the optical flow from the first intermediate video frame to the first video frame to perform backward mapping on the first video frame to obtain the first mapped video frame;
步骤A4:根据第一映射视频帧确定第一中间视频帧。Step A4: Determine the first intermediate video frame according to the first mapped video frame.
对于步骤A4,在不同的实现方式中,可以直接将第一映射视频帧作为第一中间视频帧;也可以基于第一中间视频帧到第一视频帧的光流对第一映射视频帧进行修正,获得第一中间视频帧,例如,可以设计一个神经网络对第一映射视频帧进行修正,该神经网络的结构可以参考第三神经网络,但由于不涉及视频帧融合,所以该神经网络只需预测图像修正项即可。步骤A1至A4的其他内容可以参考步骤S110至步骤S140,不再详细说明。For step A4, in different implementations, the first mapped video frame may be directly used as the first intermediate video frame; the first mapped video frame may also be modified based on the optical flow from the first intermediate video frame to the first video frame , to obtain the first intermediate video frame. For example, a neural network can be designed to modify the first mapped video frame. The structure of the neural network can refer to the third neural network, but since it does not involve video frame fusion, the neural network only needs to It is enough to predict the image correction term. For other contents of steps A1 to A4, reference may be made to steps S110 to S140, which will not be described in detail.
方案BOption B
步骤B1:获取第一视频帧和第二视频帧;Step B1: obtaining the first video frame and the second video frame;
步骤B2:基于第一视频帧和第二视频帧,利用第一神经网络计算出第一中间视频帧到第二视频帧的光流;Step B2: Based on the first video frame and the second video frame, use the first neural network to calculate the optical flow from the first intermediate video frame to the second video frame;
步骤B3:利用第一中间视频帧到第二视频帧的光流对第二视频帧进行后向映射,获得第二映射视频帧;Step B3: using the optical flow from the first intermediate video frame to the second video frame to perform backward mapping on the second video frame to obtain the second mapped video frame;
步骤B4:根据第二映射视频帧确定第一中间视频帧。Step B4: Determine the first intermediate video frame according to the second mapped video frame.
对于步骤B4,在不同的实现方式中,可以直接将第二映射视频帧作为第一中间视频帧;也可以基 于第一中间视频帧到第二视频帧的光流对第二映射视频帧进行修正,获得第一中间视频帧。步骤B1至B4的其他内容可以参考步骤S110至步骤S140,不再详细说明。For step B4, in different implementations, the second mapped video frame may be directly used as the first intermediate video frame; the second mapped video frame may also be modified based on the optical flow from the first intermediate video frame to the second video frame , to obtain the first intermediate video frame. For other contents of steps B1 to B4, reference may be made to steps S110 to S140, which will not be described in detail.
综上所述,本申请实施例提供的插帧方法在进行视频插帧时,直接基于第一视频帧和第二视频帧,利用第一神经网络计算出中间帧光流(指第一中间视频帧到第一视频帧的光流和/或第一中间视频帧到第二视频帧的光流),而无需利用第一视频帧和第二视频帧之间的光流计算中间帧光流,从而得到的中间帧光流准确性较高,在此基础上获得的第一中间视频帧图像质量较好,在运动物体的边缘也不容易产生重影。另外,上述方法步骤简单,显著提高了插帧效率,从而在应用于实时插帧、高清视频插帧等场景时,也能取得较好的效果。To sum up, the frame insertion method provided by the embodiments of the present application, when performing video frame insertion, directly calculates the optical flow of the intermediate frame (referring to the first intermediate video frame) based on the first video frame and the second video frame by using the first neural network. frame to first video frame and/or first intermediate video frame to second video frame optical flow) without using the optical flow between the first video frame and the second video frame to calculate the intermediate frame optical flow, The accuracy of the optical flow of the intermediate frame thus obtained is high, and the image quality of the first intermediate video frame obtained on this basis is good, and ghost images are not easily generated at the edge of the moving object. In addition, the steps of the above method are simple, and the frame insertion efficiency is significantly improved, so that good results can also be achieved when applied to scenarios such as real-time frame insertion and high-definition video frame insertion.
需要指出,在视频插帧方法的各种可能的实现方式中,所有使用后向映射之处,也可以替换为前向映射(forward warp),同时映射所使用的光流也需要相应地调整。例如,若采用Flow mid→1对第一视频帧做后向映射,替换后应采用Flow 1→mid(第一视频帧到第一中间视频帧的光流)对第一视频帧做前向映射,并且第一神经网络也应改为输出Flow 1→mid;又例如,若采用Flow mid→2对第二视频帧做后向映射,替换后应采用Flow 2→mid(第二视频帧到第一中间视频帧的光流)对第二视频帧做前向映射,并且第一神经网络也应改为输出Flow 2→midIt should be pointed out that, in various possible implementation manners of the video frame insertion method, all the places where backward mapping is used can also be replaced by forward warp, and the optical flow used for the mapping also needs to be adjusted accordingly. For example, if Flow mid→1 is used for backward mapping of the first video frame, after replacement, Flow 1→mid (optical flow from the first video frame to the first intermediate video frame) should be used for forward mapping of the first video frame , and the first neural network should also be changed to output Flow 1→mid ; for another example, if Flow mid→2 is used to do backward mapping to the second video frame, after replacement, Flow 2→mid (the second video frame to the The optical flow of an intermediate video frame) is forward mapped to the second video frame, and the first neural network should also output Flow 2→mid instead.
此外,还需要指出,在视频插帧方法的某些实现方式中,不止一个步骤会对视频帧进行映射(例如,步骤S130中会进行后向映射,步骤S120若采用图3的实现方式也会进行后向映射),这些步骤或者全部采用后向映射,或者全部采用前向映射,即在视频插帧流程中,映射类型应保持一致。In addition, it should also be pointed out that in some implementations of the video frame insertion method, more than one step will map the video frame (for example, backward mapping will be performed in step S130, and if the implementation of FIG. 3 is adopted in step S120, it will also be Backward mapping), these steps either all use backward mapping, or all use forward mapping, that is, in the video frame insertion process, the mapping type should be consistent.
相较而言,采用前向后向映射需要解决多点映射到同一位置时的融合问题,并且目前的硬件对前向后向映射支持不足,因此在本申请中主要以后向后向映射为例,但并非要排除采用前向后向映射的方案。In comparison, the use of forward-backward mapping needs to solve the fusion problem when multiple points are mapped to the same position, and the current hardware does not support forward-backward mapping enough, so in this application, backward-backward mapping is mainly used as an example. , but it is not intended to preclude the use of forward-backward-mapping schemes.
图7示出了本申请实施例提供的模型训练方法的一种可能的流程,该方法可用于训练图1中的模型插帧方法所使用的第一神经网络模型。图8则示出了该方法中可以采用的一种网络架构,供阐述模型训练方法时参考。图7中的方法可以但不限于由图11示出的电子设备执行,关于该电子设备的结构,可以参考后文关于图11的阐述。参照图7,该方法包括:FIG. 7 shows a possible flow of the model training method provided by the embodiment of the present application, and the method can be used to train the first neural network model used in the model frame insertion method in FIG. 1 . Figure 8 shows a network architecture that can be used in the method for reference when describing the model training method. The method in FIG. 7 may be performed by, but is not limited to, the electronic device shown in FIG. 11 , and for the structure of the electronic device, reference may be made to the following description about FIG. 11 . Referring to Figure 7, the method includes:
步骤S210:获取训练样本。Step S210: Obtain training samples.
训练集由多个训练样本构成,在训练过程中对每个训练样本的使用方式是类似的,因此可以以其中任意一个训练样本为例说明训练过程。每个训练样本可以包括3个视频帧,分别是第三视频帧、第四视频帧以及位于第三视频帧与第四视频帧之间的参考视频帧,这3个视频帧分别记作I 3、I 4和I mid,如图8所示。其中,I 3和I 4之间待插入的视频帧为第二中间视频帧,记为I syn2,I mid与I syn2对应,表示I syn2位置处真实的视频帧(即中间帧的ground truth)。在选择训练样本时,可以从视频中取连续的3帧作为一个样本,这3帧中的第一帧作为I 3,第二帧作为I mid,第三帧作为I 4The training set consists of multiple training samples, and each training sample is used in a similar manner during the training process. Therefore, any one of the training samples can be used as an example to illustrate the training process. Each training sample may include 3 video frames, namely the third video frame, the fourth video frame, and the reference video frame located between the third video frame and the fourth video frame, and these 3 video frames are denoted as I 3 respectively , I 4 and I mid , as shown in FIG. 8 . Wherein, the video frame to be inserted between I 3 and I 4 is the second intermediate video frame, denoted as I syn2 , and I mid corresponds to I syn2 , representing the real video frame at the position of I syn2 (that is, the ground truth of the intermediate frame) . When selecting training samples, three consecutive frames can be taken from the video as a sample, the first frame of the three frames is taken as I 3 , the second frame is taken as I mid , and the third frame is taken as I 4 .
步骤S220:基于第三视频帧和第四视频帧,利用第一神经网络计算第二中间视频帧到第三视频帧的光流和第二中间视频帧到第四视频帧的光流。Step S220: Based on the third video frame and the fourth video frame, use the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame and the optical flow from the second intermediate video frame to the fourth video frame.
该步骤可以参考步骤S120,不再详细阐述。为方便阐述,第二中间视频帧到第三视频帧的光流和第二中间视频帧到第四视频帧的光流分别记为Flow mid→3和Flow mid→4。在图8中,假定物体在I 3和I 4之间线性运动,则有Flow mid→3=-Flow mid→4,从而在图8中,第一神经网络只需计算出Flow mid→3即可。 For this step, reference may be made to step S120, which will not be described in detail. For convenience of description, the optical flow from the second intermediate video frame to the third video frame and the optical flow from the second intermediate video frame to the fourth video frame are denoted as Flow mid→3 and Flow mid→4 , respectively. In Figure 8, assuming that the object moves linearly between I 3 and I 4 , there is Flow mid→3 =-Flow mid→4 , so in Figure 8, the first neural network only needs to calculate Flow mid→3 , that is Can.
步骤S230:利用第二中间视频帧到第三视频帧的光流对第三视频帧进行后向映射,获得第三映射视频帧,以及,利用第二中间视频帧到第四视频帧的光流对第四视频帧进行后向映射,获得第四映射视频帧。Step S230: performing backward mapping on the third video frame by using the optical flow from the second intermediate video frame to the third video frame to obtain a third mapped video frame, and using the optical flow from the second intermediate video frame to the fourth video frame Perform backward mapping on the fourth video frame to obtain a fourth mapped video frame.
步骤S220中计算出Flow mid→3后,在步骤S230中利用Flow mid→3可以对I 3进行后向映射,得到第三映射视频帧,记为
Figure PCTCN2021085220-appb-000043
即有
Figure PCTCN2021085220-appb-000044
以及对I 4进行后向映射,得到第四映射视频帧,记为
Figure PCTCN2021085220-appb-000045
即有
Figure PCTCN2021085220-appb-000046
如图8所示。
After Flow mid→3 is calculated in step S220, in step S230, Flow mid→3 can be used to perform backward mapping on I 3 to obtain a third mapped video frame, denoted as
Figure PCTCN2021085220-appb-000043
that is
Figure PCTCN2021085220-appb-000044
And carry out backward mapping to I 4 , obtain the fourth mapping video frame, denoted as
Figure PCTCN2021085220-appb-000045
that is
Figure PCTCN2021085220-appb-000046
As shown in Figure 8.
步骤S240:根据第三映射视频帧和第四映射视频帧确定第二中间视频帧。Step S240: Determine the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame.
步骤S240可以参考步骤S140。在一些实现方式中,步骤S240中会采用第三神经网络进行图像修正,参照图8,该过程具体为:Step S240 may refer to step S140. In some implementations, in step S240, a third neural network is used to perform image correction. Referring to FIG. 8 , the process is as follows:
首先,将
Figure PCTCN2021085220-appb-000047
Flow mid→3输入至第三神经网络,利用第三神经网络预测出第二图像修正项以及第三融合掩膜,分别记为I res2和mask3。然后,根据mask3中像素值的指示将
Figure PCTCN2021085220-appb-000048
Figure PCTCN2021085220-appb-000049
融合 为I fusion2,具体方法可参考前文对mask1的描述。最后,利用I res2对I fusion2进行修正,得到I syn2
First, put
Figure PCTCN2021085220-appb-000047
Flow mid→3 is input to the third neural network, and the third neural network is used to predict the second image correction term and the third fusion mask, which are denoted as I res2 and mask3 respectively. Then, as indicated by the pixel values in mask3,
Figure PCTCN2021085220-appb-000048
and
Figure PCTCN2021085220-appb-000049
The fusion is I fusion2 , and the specific method can refer to the description of mask1 above. Finally, I fusion2 is corrected by I res2 to obtain I syn2 .
在另一些实现方式中,还可以简化上面的方案:首先将
Figure PCTCN2021085220-appb-000050
Flow mid→3输入至第四神经网络,然后利用第四神经网络预测出第四融合掩膜,记为mask4,最后根据mask4中像素值的指示将
Figure PCTCN2021085220-appb-000051
Figure PCTCN2021085220-appb-000052
融合为直接融合为I syn2
In other implementations, the above scheme can also be simplified: first, the
Figure PCTCN2021085220-appb-000050
Flow mid→3 is input to the fourth neural network, and then the fourth neural network is used to predict the fourth fusion mask, denoted as mask4, and finally according to the indication of the pixel value in mask4
Figure PCTCN2021085220-appb-000051
and
Figure PCTCN2021085220-appb-000052
fused to directly fused to I syn2 .
当然,在一些实现方式中,也可以不进行图像修正,例如直接将
Figure PCTCN2021085220-appb-000053
Figure PCTCN2021085220-appb-000054
求平均得到I syn2
Of course, in some implementations, image correction may not be performed, for example, directly
Figure PCTCN2021085220-appb-000053
and
Figure PCTCN2021085220-appb-000054
Take the average to get I syn2 .
步骤S250:根据第二中间视频帧和参考视频帧计算预测损失,并根据预测损失更新第一神经网络的参数。Step S250: Calculate the prediction loss according to the second intermediate video frame and the reference video frame, and update the parameters of the first neural network according to the prediction loss.
关于损失计算稍后再说明,首先,在本申请的方案中,第一神经网络是必然会使用的,因此在计算出预测损失后,可以利用反向传播算法更新第一神经网络的参数。其次,若步骤S240中利用了第三神经网络,则在步骤S250中,第三神经网络的参数也会一同更新,即将第三神经网络和第一神经网络一同进行训练,此举可以简化训练过程。类似地,若步骤S240中利用了第四神经网络,则在步骤S250中,第四神经网络的参数也会一同更新,即将第四神经网络和第一神经网络一同进行训练。在训练时,迭代执行步骤S210至步骤S250,在满足训练终止条件(例如,模型收敛等)时结束训练。The loss calculation will be explained later. First, in the solution of the present application, the first neural network is bound to be used. Therefore, after the prediction loss is calculated, the parameters of the first neural network can be updated by using the back-propagation algorithm. Secondly, if the third neural network is used in step S240, then in step S250, the parameters of the third neural network are also updated together, that is, the third neural network and the first neural network are trained together, which can simplify the training process . Similarly, if the fourth neural network is used in step S240, in step S250, the parameters of the fourth neural network are also updated together, that is, the fourth neural network and the first neural network are trained together. During training, steps S210 to S250 are iteratively executed, and the training is terminated when the training termination condition (eg, model convergence, etc.) is satisfied.
预测损失可以用如下公式统一表示:The prediction loss can be uniformly expressed by the following formula:
Loss sum=Loss l1+αLoss sobel+βLoss epe+γLoss multiscale-epe Loss sum = Loss l1 +αLoss sobel +βLoss epe +γLoss multiscale-epe
其中,Loss sum为总的预测损失,右侧有四项损失,分别是第一损失Loss l1,第二损失Loss sobel,第三损失Loss epe以及第四损失Loss multiscale-epe,其中第一损失是基本损失,在计算预测损失时必然会包含,其他三项损失是可选的,根据实现方式的不同,可以加入其中的一项或多项,也可以都不加入,但注意第三损失和第四损失不能同时加入。α、β、γ为加权系数,作为网络的超参数。应当理解,等式右侧还可以加入其他损失项。下面具体介绍每项损失: Among them, Loss sum is the total prediction loss, and there are four losses on the right side, namely the first loss Loss l1 , the second loss Loss sobel , the third loss Loss epe and the fourth loss Loss multiscale-epe , where the first loss is The basic loss must be included when calculating the prediction loss. The other three losses are optional. Depending on the implementation, one or more of them can be added, or none of them can be added, but pay attention to the third loss and the third loss. Four losses cannot be added at the same time. α, β, and γ are weighting coefficients, which are used as hyperparameters of the network. It should be understood that other loss terms may also be added to the right-hand side of the equation. Each loss is described in detail below:
第一损失根据I syn2与I mid的差异进行计算,设置第一损失的目的是为了通过学习,使得I syn2更接近于I mid,即使得中间帧的图像质量更好。在一些实现方式中,I syn2与I mid的差异可以定义为二者的逐像素距离,例如,在采用L1距离时: The first loss is calculated according to the difference between I syn2 and I mid , and the purpose of setting the first loss is to make I syn2 closer to I mid through learning, that is, to make the image quality of the intermediate frame better. In some implementations, the difference between I syn2 and I mid can be defined as the pixel-by-pixel distance between the two, for example, when using the L1 distance:
Loss l1=∑i∑j|I syn2(i,j)-I mid(i,j)| Loss l1 =∑i∑j|I syn2 (i,j)-I mid (i,j)|
其中,i、j共同表示一个像素位置。Among them, i and j together represent a pixel position.
第二损失根据I syn2的图像梯度与I mid的图像梯度的差异进行计算,设置第二损失的目的是为了通过学习,改善生成的I syn2的物体边缘模糊的问题(图像梯度对应图像中的边缘信息)。其中,图像梯度可以对图像应用梯度算子进行计算,例如Sobel算子、Roberts算子、Prewitt算子等,I syn2的图像梯度与I mid的图像梯度的差异可以定义为二者的逐像素距离。例如,在采用Sobel算子及L1距离时有: The second loss is calculated according to the difference between the image gradient of I syn2 and the image gradient of I mid . The purpose of setting the second loss is to improve the problem of blurred object edges of the generated I syn2 through learning (the image gradient corresponds to the edge in the image information). The image gradient can be calculated by applying a gradient operator to the image, such as Sobel operator, Roberts operator, Prewitt operator, etc. The difference between the image gradient of I syn2 and the image gradient of I mid can be defined as the pixel-by-pixel distance between the two . For example, when using Sobel operator and L1 distance:
Loss sobel=∑i∑j|Sobel(I syn2)(i,j)-Sobel(I mid)(i,j)| Loss sobel =∑i∑j|Sobel(I syn2 )(i,j)-Sobel(I mid )(i,j)|
其中,Sobel(·)表示利用Sobel算子计算某个图像的图像梯度。Among them, Sobel( ) represents the use of the Sobel operator to calculate the image gradient of an image.
上述第一损失和第二损失的计算,都是和I syn2直接相关的,但I syn2是根据Flow mid→3计算的,因此第一神经网络对光流计算的准确性也十分重要,从而,在一些实现方式中,可以设置光流标签,对第一神经网络进行有监督的训练。 The calculation of the above-mentioned first loss and second loss is directly related to I syn2 , but I syn2 is calculated according to Flow mid→3 , so the first neural network is also very important for the accuracy of optical flow calculation, so, In some implementations, optical flow labels can be set to perform supervised training of the first neural network.
例如,参照图8,预训练(指在执行图7的步骤之前就将该网络训练好)一个具有光流计算功能的第五神经网络(例如,LiteFlownet网络),将I 3和I mid输入至第五神经网络,将第五神经网络计算出的参考视频帧到第三视频帧的光流(记作
Figure PCTCN2021085220-appb-000055
)作为光流标签(即中间帧光流的ground truth)。其中,计算两个视频帧之间的光流(而非两个视频帧的中间帧的光流)是现有的光流计算网络可以做到的。
For example, referring to Fig. 8, pre-training (meaning that the network is trained before performing the steps of Fig. 7) a fifth neural network (eg, LiteFlownet network) with optical flow calculation function, input I 3 and I mid to The fifth neural network, the optical flow from the reference video frame calculated by the fifth neural network to the third video frame (denoted as
Figure PCTCN2021085220-appb-000055
) as the optical flow labels (i.e. the ground truth of the optical flow of intermediate frames). Among them, the calculation of the optical flow between two video frames (rather than the optical flow of the middle frame of the two video frames) can be done by the existing optical flow calculation network.
第三损失根据第三神经网络计算出的Flow mid→3
Figure PCTCN2021085220-appb-000056
的差异进行计算,设置第三损失的目的是为了通过学习,改善第三神经网络计算出的Flow mid→3的准确性,该项损失体现了从第五神经网络到第三神经网络的光流知识迁移。在一些实现方式中,Flow mid→3
Figure PCTCN2021085220-appb-000057
的差异可以定义为二者包含的光流向量之间的距离(L2距离),用公式表示如下:
The third loss is based on the Flow mid→3 calculated by the third neural network and
Figure PCTCN2021085220-appb-000056
The purpose of setting the third loss is to improve the accuracy of Flow mid→3 calculated by the third neural network through learning. This loss reflects the optical flow from the fifth neural network to the third neural network. Knowledge transfer. In some implementations, Flow mid→3 is the same as
Figure PCTCN2021085220-appb-000057
The difference can be defined as the distance between the optical flow vectors contained in the two (L2 distance), which is expressed as follows:
Figure PCTCN2021085220-appb-000058
Figure PCTCN2021085220-appb-000058
其中,Flow mid→3(i,j)、
Figure PCTCN2021085220-appb-000059
均表示在像素位置(i,j)处的光流向量。
Among them, Flow mid→3 (i,j),
Figure PCTCN2021085220-appb-000059
Both represent the optical flow vector at pixel position (i, j).
可选地,若第一神经网络包括至少一个光流计算模块(其结构参考图3),且每个光流计算模块均输出经该模块修正后的Flow mid→3,由粗到精地对Flow mid→3进行计算。此时,可以利用光流标签对每个光 流计算模块都进行监督,改善每个光流计算模块的光流计算能力。具体做法为,对每个光流计算模块,都利用其输出的光流Flow mid→3与第五神经网络计算出的光流
Figure PCTCN2021085220-appb-000060
之间的差异计算一个损失(计算方式可参考第三损失的计算),然后将这些损失累加后得到第四损失。用公式表示第四损失的计算过程如下:
Optionally, if the first neural network includes at least one optical flow calculation module (refer to FIG. 3 for its structure), and each optical flow calculation module outputs the Flow mid→3 corrected by the module, from coarse to fine Flow mid→3 is calculated. At this time, each optical flow calculation module can be supervised by using the optical flow label to improve the optical flow calculation capability of each optical flow calculation module. The specific method is that, for each optical flow calculation module, the output optical flow Flow mid→3 and the optical flow calculated by the fifth neural network are used.
Figure PCTCN2021085220-appb-000060
The difference between them calculates a loss (for the calculation method, please refer to the calculation of the third loss), and then accumulate these losses to obtain the fourth loss. The formula for calculating the fourth loss is as follows:
Figure PCTCN2021085220-appb-000061
Figure PCTCN2021085220-appb-000061
其中,n表示光流计算模块的总个数,
Figure PCTCN2021085220-appb-000062
表示第k个光流计算模块输出的Flow mid→3
Among them, n represents the total number of optical flow calculation modules,
Figure PCTCN2021085220-appb-000062
Represents Flow mid→3 output by the kth optical flow calculation module.
相较于第三损失,第四损失同样能够实现从第五神经网络到第三神经网络的光流知识迁移,并且通过计算第四损失,有利于更精确地调整每个光流计算模块的参数,但第四损失计算上会复杂一些。Compared with the third loss, the fourth loss can also realize the optical flow knowledge transfer from the fifth neural network to the third neural network, and by calculating the fourth loss, it is beneficial to adjust the parameters of each optical flow calculation module more accurately , but the fourth loss is computationally more complicated.
可选地,发明人长期研究发现,第五神经网络在进行光流计算时,可能会由于边界和遮挡区域存在歧义等原因,导致在部分像素位置处计算出的光流向量不准确,对于这些光流向量,可不将其作为第一神经网络进行有监督学习的标签,而只将那些计算较准确的光流向量作为光流标签。其具体做法如下:Optionally, the inventor's long-term research has found that when the fifth neural network performs optical flow calculation, the optical flow vector calculated at some pixel positions may be inaccurate due to the ambiguity of the boundary and the occlusion area. The optical flow vector can not be used as the label for the supervised learning of the first neural network, but only those optical flow vectors that are more accurately calculated are used as the optical flow label. The specific method is as follows:
首先,利用第五神经网络计算出的
Figure PCTCN2021085220-appb-000063
对I 3进行后向映射(当然,可以采用前向后向映射),获得第五映射视频帧,记为
Figure PCTCN2021085220-appb-000064
First, using the fifth neural network to calculate
Figure PCTCN2021085220-appb-000063
Perform backward mapping on I 3 (of course, forward and backward mapping can be used) to obtain the fifth mapped video frame, denoted as
Figure PCTCN2021085220-appb-000064
然后,根据
Figure PCTCN2021085220-appb-000065
与I mid的差异确定第五神经网络计算出的每个像素位置处的光流向量是否准确。例如,可以计算
Figure PCTCN2021085220-appb-000066
与I mid在每个像素处的L1距离的均值(因为视频帧可能是多通道的图像,所以在每个像素处可以求均值),若L1距离的均值大于某个阈值,则表明第五神经网络在该像素位置处计算出光流向量不准确,否则表明第五神经网络在该像素位置处计算出光流向量准确,对于那些计算准确的光流向量,不妨称为第一有效光流向量,实验表明,第一有效光流向量占第五神经网络计算出的光流向量中的绝大多数,因为第五神经网络相当于在已知中间帧的情况下计算中间帧光流,其准确性还是可以保证的。
Then, according to
Figure PCTCN2021085220-appb-000065
The difference from I mid determines whether the optical flow vector at each pixel position calculated by the fifth neural network is accurate. For example, it can be calculated
Figure PCTCN2021085220-appb-000066
The mean value of the L1 distance from I mid at each pixel (because the video frame may be a multi-channel image, the mean value can be calculated at each pixel), if the mean value of the L1 distance is greater than a certain threshold, it indicates that the fifth neural The optical flow vector calculated by the network at the pixel position is inaccurate, otherwise it indicates that the optical flow vector calculated by the fifth neural network at the pixel position is accurate. For those optical flow vectors that are accurately calculated, it may be called the first effective optical flow vector. It shows that the first effective optical flow vector accounts for the vast majority of the optical flow vectors calculated by the fifth neural network, because the fifth neural network is equivalent to calculating the optical flow of the intermediate frame when the intermediate frame is known, and its accuracy is still Guaranteed.
最后,根据第五神经网络计算出的光流中的第一有效光流向量计算第三损失或第四损失:Finally, the third loss or the fourth loss is calculated according to the first effective optical flow vector in the optical flow calculated by the fifth neural network:
在计算第三损失时,根据第五神经网络计算出的
Figure PCTCN2021085220-appb-000067
中的第一有效光流向量与第一神经网络计算出的Flow mid→3中的第二有效光流向量的差异进行计算;其中,第二有效光流向量是指第一神经网络计算出的Flow mid→3中位于第一有效光流向量对应的像素位置处的光流向量。例如,第五神经网络计算出的
Figure PCTCN2021085220-appb-000068
中位于(1,1)处的光流向量是一个第一有效光流向量,则第一神经网络计算出的Flow mid→3中位于(1,1)处的光流向量是一个第二有效光流向量。
When calculating the third loss, according to the fifth neural network
Figure PCTCN2021085220-appb-000067
The difference between the first effective optical flow vector in and the second effective optical flow vector in Flow mid→3 calculated by the first neural network is calculated; wherein, the second effective optical flow vector refers to the first effective optical flow vector calculated by the first neural network. The optical flow vector located at the pixel position corresponding to the first effective optical flow vector in Flow mid→3 . For example, the fifth neural network calculates
Figure PCTCN2021085220-appb-000068
The optical flow vector located at (1,1) is a first effective optical flow vector, then the optical flow vector located at (1,1) in Flow mid→3 calculated by the first neural network is a second effective optical flow vector Optical flow vector.
在计算第四损失时,根据第五神经网络计算出的
Figure PCTCN2021085220-appb-000069
中的第一有效光流向量与第一神经网络的每个光流计算模块输出的Flow mid→3中的第三有效光流向量的差异进行计算(分别计算差异后累加)。其中,第三有效光流向量是指每个光流计算模块输出的Flow mid→3中位于第一有效光流向量对应的像素位置处的光流向量。
When calculating the fourth loss, according to the fifth neural network
Figure PCTCN2021085220-appb-000069
The difference between the first effective optical flow vector in and the third effective optical flow vector in Flow mid→3 output by each optical flow calculation module of the first neural network is calculated (the differences are calculated separately and then accumulated). The third effective optical flow vector refers to the optical flow vector located at the pixel position corresponding to the first effective optical flow vector in Flow mid→3 output by each optical flow calculation module.
前文提到,在一些实现方式中,第一神经网络中的光流计算模块是基于LiteFlownet网络进行结构迁移得到的(即在步骤S220中,每个光流计算模块均利用从LiteFlownet网络中迁移过来的描述子匹配单元、亚像素修正层以及正则化层对输入该光流计算模块的光流进行修正)。对于这些实现方式,在训练第一神经网络时,可以直接载入LiteFlownet网络预训练得到的参数作为其参数的初始值,并在此基础上进行参数微调(finetune),此种迁移学习的方式不仅可以加快第一神经网络的收敛速度,还有利于改善其性能。其中,LiteFlownet网络进行预训练,可以但不限于采用FlyingChairs数据集。As mentioned above, in some implementations, the optical flow calculation module in the first neural network is obtained by performing structure migration based on the LiteFlownet network (that is, in step S220, each optical flow calculation module is migrated from the LiteFlownet network using The descriptor matching unit, the sub-pixel correction layer and the regularization layer correct the optical flow input to the optical flow calculation module). For these implementations, when training the first neural network, the parameters obtained by the pre-training of the LiteFlownet network can be directly loaded as the initial values of its parameters, and on this basis, the parameters can be fine-tuned. This transfer learning method not only The convergence speed of the first neural network can be accelerated, and its performance can be improved. Among them, the LiteFlownet network is pre-trained, but it is not limited to the FlyingChairs dataset.
在图8示出的方案中,第二中间视频帧是通过第三映射视频帧和第四映射视频帧融合产生的(可能还会进行修正),但也存在一些方案,第二中间视频帧是直接基于第三映射视频帧或第四映射视频帧产生的(可能还会进行修正)。这些方案的具体步骤如下:In the scheme shown in FIG. 8, the second intermediate video frame is generated by fusion of the third mapped video frame and the fourth mapped video frame (may be modified), but there are also some schemes, the second intermediate video frame is Directly based on the third mapped video frame or the fourth mapped video frame (possibly with modifications). The specific steps of these programs are as follows:
方案COption C
步骤C1:获取训练样本,训练样本包括第三视频帧、第四视频帧和参考视频帧;Step C1: obtaining training samples, the training samples include the third video frame, the fourth video frame and the reference video frame;
步骤C2:基于第三视频帧和第四视频帧,利用第一神经网络计算出第二中间视频帧到第三视频帧的光流;Step C2: based on the third video frame and the fourth video frame, use the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame;
步骤C3:利用第二中间视频帧到第三视频帧的光流对第三视频帧进行后向映射,获得第三映射视频帧;Step C3: using the optical flow from the second intermediate video frame to the third video frame to perform backward mapping on the third video frame to obtain the third mapped video frame;
步骤C4:根据第三映射视频帧确定第二中间视频帧;Step C4: determining the second intermediate video frame according to the third mapped video frame;
步骤C5:根据第二中间视频帧和参考视频帧计算预测损失,并根据预测损失更新第一神经网络的参数。Step C5: Calculate the prediction loss according to the second intermediate video frame and the reference video frame, and update the parameters of the first neural network according to the prediction loss.
其中,若步骤C4中使用了神经网络(其结构可参考第三神经网络)对第三映射视频帧进行修正,则在步骤C5中该神经网络可以和第一神经网络一同进行参数更新。步骤C1至C5的其他内容可以参考步骤S210至步骤S250,不再详细说明。Wherein, if a neural network (whose structure can refer to the third neural network) is used to modify the third mapped video frame in step C4, then in step C5, the neural network can perform parameter update together with the first neural network. For other contents of steps C1 to C5, reference may be made to steps S210 to S250, which will not be described in detail.
方案DOption D
步骤D1:获取训练样本,训练样本包括第三视频帧、第四视频帧和参考视频帧;Step D1: obtaining training samples, the training samples include a third video frame, a fourth video frame and a reference video frame;
步骤D2:基于第三视频帧和第四视频帧,利用第一神经网络计算出第二中间视频帧到第四视频帧的光流;Step D2: based on the third video frame and the fourth video frame, use the first neural network to calculate the optical flow from the second intermediate video frame to the fourth video frame;
步骤D3:利用第二中间视频帧到第四视频帧的光流对第四视频帧进行后向映射,获得第四映射视频帧;Step D3: using the optical flow from the second intermediate video frame to the fourth video frame to perform backward mapping on the fourth video frame to obtain the fourth mapped video frame;
步骤D4:根据第四映射视频帧确定第二中间视频帧;Step D4: determining the second intermediate video frame according to the fourth mapped video frame;
步骤D5:根据第二中间视频帧和参考视频帧计算预测损失,并根据预测损失更新第一神经网络的参数。Step D5: Calculate the prediction loss according to the second intermediate video frame and the reference video frame, and update the parameters of the first neural network according to the prediction loss.
其中,若步骤D4中使用了神经网络(其结构可参考第三神经网络)对第四映射视频帧进行修正,则在步骤D5中该神经网络可以和第一神经网络一同进行参数更新。步骤D1至D5的其他内容可以参考步骤S210至步骤S250,不再详细说明。Wherein, if a neural network (the structure of which can refer to the third neural network) is used to modify the fourth mapped video frame in step D4, then in step D5, the neural network can perform parameter update together with the first neural network. For other contents of steps D1 to D5, reference may be made to steps S210 to S250, which will not be described in detail.
需要指出,若设置有第五神经网络提供光流标签,则第五神经网络的计算结果应当与第一神经网络的计算结果保持对应。比如,若第一神经网络计算的是第二中间视频帧到第三视频帧的光流(方案C),则第五神经网络应基于第三视频帧和参考视频帧计算二者之间的光流;若第一神经网络计算的是第二中间视频帧到第四视频帧的光流(方案D),则第五神经网络应基于第四视频帧和参考视频帧计算二者之间的光流;若第一神经网络计算的是第二中间视频帧到第三视频帧的光流以及第二中间视频帧到第四视频帧的光流(图7中的方案),则第五神经网络应基于第三视频帧和参考视频帧计算二者之间的光流,以及,基于第四视频帧和参考视频帧计算二者之间的光流。It should be pointed out that if a fifth neural network is provided to provide an optical flow label, the calculation result of the fifth neural network should keep corresponding to the calculation result of the first neural network. For example, if the first neural network calculates the optical flow from the second intermediate video frame to the third video frame (scheme C), the fifth neural network should calculate the optical flow between the third video frame and the reference video frame based on the third video frame and the reference video frame. flow; if the first neural network calculates the optical flow from the second intermediate video frame to the fourth video frame (scheme D), then the fifth neural network should calculate the optical flow between the two based on the fourth video frame and the reference video frame. flow; if the first neural network calculates the optical flow from the second intermediate video frame to the third video frame and the optical flow from the second intermediate video frame to the fourth video frame (the scheme in Figure 7), then the fifth neural network The optical flow between the two should be calculated based on the third video frame and the reference video frame, and the optical flow between the two should be calculated based on the fourth video frame and the reference video frame.
需要指出,在模型训练方法的各种可能的实现方式中,所有使用后向映射之处,也可以替换为前向映射,同时映射所使用的光流也需要相应地调整。例如,若采用Flow mid→3对第三视频帧做后向映射,替换后应采用Flow 3→mid(第三视频帧到第二中间视频帧的光流)对第三视频帧做前向映射,并且第一神经网络也应改为输出Flow 3→mid;又例如,若采用Flow mid→4对第四视频帧做后向映射,替换后应采用Flow 4→mid(第四视频帧到第二中间视频帧的光流)对第四视频帧做前向映射,并且第一神经网络也应改为输出Flow 4→midIt should be pointed out that in various possible implementations of the model training method, all the places where backward mapping is used can also be replaced with forward mapping, and the optical flow used in the mapping also needs to be adjusted accordingly. For example, if Flow mid→3 is used for backward mapping of the third video frame, after replacement, Flow 3→mid (optical flow from the third video frame to the second intermediate video frame) should be used for forward mapping of the third video frame , and the first neural network should also be changed to output Flow 3→mid ; for another example, if Flow mid→4 is used to do backward mapping to the fourth video frame, after replacement, Flow 4→mid (the fourth video frame to the The optical flow of the second middle video frame) does forward mapping to the fourth video frame, and the first neural network should also output Flow 4→mid instead.
此外,还需要指出,在模型训练方法的某些实现方式中,不止一个步骤会对视频帧进行映射,这些步骤或者全部采用后向映射,或者全部采用前向映射,即在模型训练流程中,映射类型应保持一致。In addition, it should be pointed out that in some implementations of the model training method, more than one step will map the video frame, and these steps either all use backward mapping, or all use forward mapping, that is, in the model training process, Mapping types should be consistent.
图9示出了本申请实施例提供的视频插帧装置300的功能模块图。参照图9,视频插帧装置300包括:FIG. 9 shows a functional block diagram of a video frame insertion apparatus 300 provided by an embodiment of the present application. 9, the video frame insertion device 300 includes:
第一视频帧获取单元310,用于获取第一视频帧和第二视频帧;The first video frame obtaining unit 310 is used to obtain the first video frame and the second video frame;
第一光流计算单元320,用于基于所述第一视频帧和所述第二视频帧,利用第一神经网络计算出第一中间视频帧到所述第一视频帧的光流和/或第一中间视频帧到所述第二视频帧的光流;其中,所述第一中间视频帧为待插入所述第一视频帧和所述第二视频帧之间的视频帧;A first optical flow calculation unit 320, configured to use a first neural network to calculate an optical flow and/or an optical flow from a first intermediate video frame to the first video frame based on the first video frame and the second video frame The optical flow from the first intermediate video frame to the second video frame; wherein, the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame;
第一后向映射单元330,用于利用所述第一中间视频帧到所述第一视频帧的光流对所述第一视频帧进行后向映射,获得第一映射视频帧,和/或,利用所述第一中间视频帧到所述第二视频帧的光流对所述第二视频帧进行后向映射,获得第二映射视频帧;A first backward mapping unit 330, configured to perform backward mapping on the first video frame by using the optical flow from the first intermediate video frame to the first video frame to obtain a first mapped video frame, and/or , using the optical flow from the first intermediate video frame to the second video frame to perform backward mapping on the second video frame to obtain a second mapped video frame;
第一中间帧确定单元340,用于根据所述第一映射视频帧和/或所述第二映射视频帧确定所述第一中间视频帧。A first intermediate frame determining unit 340, configured to determine the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame.
在视频插帧装置300的一种实现方式中,所述第一神经网络包括依次连接的至少一个光流计算模块,第一光流计算单元320基于所述第一视频帧和所述第二视频帧,利用第一神经网络计算出第一中间视频帧到所述第一视频帧的光流,包括:根据所述第一视频帧确定输入每个光流计算模块的第一图像,以及, 根据所述第二视频帧确定输入每个光流计算模块的第二图像;利用每个光流计算模块,基于输入该光流计算模块的光流分别对输入该光流计算模块的第一图像和第二图像进行后向映射,基于映射得到的第一映射图像和第二映射图像对输入该光流计算模块的光流进行修正,并输出修正后的光流;其中,输入第一个光流计算模块的光流为所述第一视频帧与所述第一中间视频帧之间的预设光流,输入其他光流计算模块的光流为上一个光流计算模块输出的光流,最后一个光流计算模块输出的光流为所述第一神经网络计算出的所述第一中间视频帧到所述第一视频帧的光流。In an implementation of the device 300 for video frame insertion, the first neural network includes at least one optical flow calculation module connected in sequence, and the first optical flow calculation unit 320 is based on the first video frame and the second video frame, using the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame, including: determining the first image input to each optical flow calculation module according to the first video frame, and, according to The second video frame determines the second image input to each optical flow calculation module; using each optical flow calculation module, based on the optical flow input to the optical flow calculation module, the first image and the input to the optical flow calculation module are respectively The second image is subjected to backward mapping, and the optical flow input to the optical flow calculation module is corrected based on the first and second mapping images obtained from the mapping, and the corrected optical flow is output; wherein, the first optical flow is input The optical flow of the calculation module is the preset optical flow between the first video frame and the first intermediate video frame, the optical flow input to other optical flow calculation modules is the optical flow output by the previous optical flow calculation module, and finally The optical flow output by an optical flow calculation module is the optical flow from the first intermediate video frame to the first video frame calculated by the first neural network.
在视频插帧装置300的一种实现方式中,第一光流计算单元320根据所述第一视频帧确定输入每个光流计算模块的第一图像,以及,根据所述第二视频帧确定输入每个光流计算模块的第二图像,包括:将所述第一视频帧作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧作为输入每个光流计算模块的第二图像;或者,将所述第一视频帧下采样后得到的图像作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧下采样后得到的图像作为输入每个光流计算模块的第二图像;其中,同一光流计算模块输入的两个下采样图像的形状相同;或者,将所述第一视频帧经卷积层处理后输出的特征图作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧经卷积层处理后输出的特征图作为输入每个光流计算模块的第二图像;其中,同一光流计算模块输入的两个特征图的形状相同。In an implementation manner of the video frame insertion apparatus 300, the first optical flow calculation unit 320 determines the first image input to each optical flow calculation module according to the first video frame, and determines according to the second video frame Inputting the second image of each optical flow calculation module includes: using the first video frame as the first image input to each optical flow calculation module, and using the second video frame as input to each optical flow calculation The second image of the module; or, the image obtained after the down-sampling of the first video frame is used as the first image input to each optical flow calculation module, and the image obtained after the down-sampling of the second video frame is used as Input the second image of each optical flow calculation module; wherein, the shapes of the two down-sampled images input by the same optical flow calculation module are the same; or, the feature map output after the first video frame is processed by the convolution layer is used as The first image of each optical flow calculation module is input, and the feature map output after the second video frame is processed by the convolution layer is used as the second image input to each optical flow calculation module; wherein, the same optical flow calculation The two feature maps input by the module have the same shape.
在视频插帧装置300的一种实现方式中,第一光流计算单元320将所述第一视频帧下采样后得到的图像作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧下采样后得到的图像作为输入每个光流计算模块的第二图像,包括:分别对所述第一视频帧和所述第二视频帧进行下采样,形成所述第一视频帧的图像金字塔和所述第二视频帧的图像金字塔,所述图像金字塔从顶层开始的每一层对应所述第一神经网络从第一个光流计算模块开始的一个光流计算模块;从两个图像金字塔的顶层开始向下进行逐层遍历,将位于同层的两个下采样图像分别作为输入该层对应的光流计算模块的第一图像和第二图像。In an implementation manner of the video frame insertion device 300, the first optical flow calculation unit 320 takes the image obtained by down-sampling the first video frame as the first image input to each optical flow calculation module, and calculates the The image obtained after the down-sampling of the second video frame is used as the second image input to each optical flow calculation module, including: down-sampling the first video frame and the second video frame respectively to form the first video frame. The image pyramid of the video frame and the image pyramid of the second video frame, each layer of the image pyramid starting from the top layer corresponds to an optical flow calculation module of the first neural network starting from the first optical flow calculation module; Starting from the top layer of the two image pyramids, the traversal is performed layer by layer, and the two down-sampled images located in the same layer are used as the first image and the second image input to the optical flow calculation module corresponding to the layer respectively.
在视频插帧装置300的一种实现方式中,第一光流计算单元320将所述第一视频帧经卷积层处理后输出的特征图作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧经卷积层处理后输出的特征图作为输入每个光流计算模块的第二图像,包括:利用第一特征提取网络分别对所述第一视频帧和所述第二视频帧进行特征提取,形成所述第一视频帧的特征金字塔和所述第二视频帧的特征金字塔,所述特征金字塔从顶层开始的每一层对应所述第一神经网络从第一个光流计算模块开始的一个光流计算模块;其中,所述第一特征提取网络为卷积神经网络;从两个特征金字塔的顶层开始向下进行逐层遍历,将位于同层的两个特征图分别作为输入该层对应的光流计算模块的第一图像和第二图像。In an implementation manner of the video frame insertion device 300, the first optical flow calculation unit 320 uses the feature map output after the first video frame is processed by the convolution layer as the first image input to each optical flow calculation module, And, using the feature map outputted by the second video frame after being processed by the convolutional layer as the second image input to each optical flow calculation module, including: using the first feature extraction network to separately extract the first video frame and the Feature extraction is performed on the second video frame to form a feature pyramid of the first video frame and a feature pyramid of the second video frame, and each layer from the top layer of the feature pyramid corresponds to the first neural network from the first An optical flow calculation module starting from an optical flow calculation module; wherein, the first feature extraction network is a convolutional neural network; from the top layer of the two feature pyramids, the traversal is performed layer by layer, and the two layers located in the same layer are traversed layer by layer. The feature maps are respectively used as the first image and the second image input to the optical flow calculation module corresponding to the layer.
在视频插帧装置300的一种实现方式中,第一光流计算单元320基于映射得到的第一映射图像和第二映射图像对输入该光流计算模块的光流进行修正,并输出修正后的光流,包括:基于映射得到的第一映射图像、第二映射图像以及输入该光流计算模块的光流,利用第二神经网络预测出光流修正项;利用所述光流修正项对输入该光流计算模块的光流进行修正,并输出修正后的光流。In an implementation manner of the video frame insertion device 300, the first optical flow calculation unit 320 corrects the optical flow input to the optical flow calculation module based on the first mapped image and the second mapped image obtained by mapping, and outputs the corrected optical flow. The optical flow, including: based on the first mapping image obtained by mapping, the second mapping image and the optical flow input to the optical flow calculation module, using the second neural network to predict the optical flow correction item; using the optical flow correction item to input The optical flow of the optical flow calculation module is corrected, and the corrected optical flow is output.
在视频插帧装置300的一种实现方式中,第一光流计算单元320基于映射得到的第一映射图像和第二映射图像对输入该光流计算模块的光流进行修正,并输出修正后的光流,包括:基于映射得到的第一映射图像和第二映射图像,利用LiteFlownet网络中的描述子匹配单元、亚像素修正层以及正则化层对输入该光流计算模块的光流进行修正,并输出修正后的光流。In an implementation manner of the video frame insertion device 300, the first optical flow calculation unit 320 corrects the optical flow input to the optical flow calculation module based on the first mapped image and the second mapped image obtained by mapping, and outputs the corrected optical flow. The optical flow includes: based on the first mapping image and the second mapping image obtained by mapping, using the descriptor matching unit, sub-pixel correction layer and regularization layer in the LiteFlownet network to modify the optical flow input to the optical flow calculation module , and output the corrected optical flow.
在视频插帧装置300的一种实现方式中,第一光流计算单元320基于所述第一视频帧和所述第二视频帧,利用第一神经网络计算出第一中间视频帧到所述第一视频帧的光流和第一中间视频帧到所述第二视频帧的光流,包括:利用第一神经网络计算出第一中间视频帧到所述第一视频帧的光流,并根据所述第一中间视频帧到所述第一视频帧的光流计算所述第一中间视频帧到所述第二视频帧的光流;或者,利用第一神经网络计算出第一中间视频帧到所述第二视频帧的光流,并根据所述第一中间视频帧到所述第二视频帧的光流计算所述第一中间视频帧到所述第一视频帧的光流。In an implementation of the apparatus 300 for video frame insertion, the first optical flow calculation unit 320 uses a first neural network to calculate, based on the first video frame and the second video frame, from the first intermediate video frame to the The optical flow of the first video frame and the optical flow from the first intermediate video frame to the second video frame includes: using the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame, and Calculate the optical flow from the first intermediate video frame to the second video frame according to the optical flow from the first intermediate video frame to the first video frame; or, use the first neural network to calculate the first intermediate video frame to the second video frame, and calculate the optical flow from the first intermediate video frame to the first video frame according to the optical flow from the first intermediate video frame to the second video frame.
在视频插帧装置300的一种实现方式中,第一光流计算单元320根据所述第一中间视频帧到所述第一视频帧的光流计算所述第一中间视频帧到所述第二视频帧的光流,包括:将所述第一中间视频帧到所述第一视频帧的光流取反后作为所述第一中间视频帧到所述第二视频帧的光流;第一光流计算单元320根据所述第一中间视频帧到所述第二视频帧的光流计算所述第一中间视频帧到所述第一视频帧的光流,包括:将所述第一中间视频帧到所述第二视频帧的光流取反后作为所述第一中间视频帧到所述第一视频 帧的光流。In an implementation of the apparatus 300 for video frame insertion, the first optical flow calculation unit 320 calculates the distance between the first intermediate video frame and the first video frame according to the optical flow between the first intermediate video frame and the first video frame. The optical flow of two video frames includes: inverting the optical flow from the first intermediate video frame to the first video frame as the optical flow from the first intermediate video frame to the second video frame; An optical flow calculation unit 320 calculates the optical flow from the first intermediate video frame to the first video frame according to the optical flow from the first intermediate video frame to the second video frame, including: The optical flow from the intermediate video frame to the second video frame is inverted as the optical flow from the first intermediate video frame to the first video frame.
在视频插帧装置300的一种实现方式中,第一中间帧确定单元340根据所述第一映射视频帧和/或所述第二映射视频帧确定所述第一中间视频帧,包括:基于所述第一中间视频帧到所述第一视频帧的光流对所述第一映射视频帧进行修正,获得所述第一中间视频帧;或者,基于所述第一中间视频帧到所述第二视频帧的光流对所述第二映射视频帧进行修正,获得所述第一中间视频帧;或者,基于所述第一中间视频帧到所述第一视频帧的光流和/或所述第一中间视频帧到所述第二视频帧的光流,对所述第一映射视频帧和所述第二映射视频帧的融合后形成的第一融合视频帧进行修正,获得所述第一中间视频帧。In an implementation of the apparatus 300 for video frame insertion, the first intermediate frame determining unit 340 determines the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame, including: based on The optical flow from the first intermediate video frame to the first video frame modifies the first mapped video frame to obtain the first intermediate video frame; or, based on the first intermediate video frame to the first video frame The optical flow of the second video frame modifies the second mapped video frame to obtain the first intermediate video frame; or, based on the optical flow from the first intermediate video frame to the first video frame and/or The optical flow from the first intermediate video frame to the second video frame is modified by modifying the first fused video frame formed after the fusion of the first mapped video frame and the second mapped video frame, to obtain the The first intermediate video frame.
在视频插帧装置300的一种实现方式中,第一中间帧确定单元340基于所述第一中间视频帧到所述第一视频帧的光流,对所述第一映射视频帧和所述第二映射视频帧的融合后形成的第一融合视频帧进行修正,获得所述第一中间视频帧,包括:基于所述第一映射视频帧、所述第二映射视频帧以及所述第一中间视频帧到所述第一视频帧的光流,利用第三神经网络预测出第一图像修正项以及第一融合掩膜;根据所述第一融合掩膜中像素值的指示将所述第一映射视频帧和所述第二映射视频帧融合为所述第一融合视频帧;利用所述第一图像修正项对所述第一融合视频帧进行修正,获得所述第一中间视频帧。In an implementation manner of the apparatus 300 for video frame insertion, the first intermediate frame determining unit 340 determines, based on the optical flow from the first intermediate video frame to the first video frame, the mapping between the first mapped video frame and the first video frame. Modifying the first fused video frame formed after the fusion of the second mapped video frame to obtain the first intermediate video frame includes: based on the first mapped video frame, the second mapped video frame and the first From the optical flow from the intermediate video frame to the first video frame, the third neural network is used to predict the first image correction item and the first fusion mask; according to the indication of the pixel value in the first fusion mask, the A mapped video frame and the second mapped video frame are fused into the first fused video frame; the first fused video frame is modified by using the first image modification item to obtain the first intermediate video frame.
在视频插帧装置300的一种实现方式中,所述第三神经网络包括第二特征提取网络以及编解码器网络,所述编解码器网络包括编码器以及解码器,第一中间帧确定单元340基于所述第一映射视频帧、所述第二映射视频帧以及所述第一中间视频帧到所述第一视频帧的光流,利用第三神经网络预测出第一图像修正项以及第一融合掩膜,包括:利用所述第二特征提取网络分别对所述第一视频帧和所述第一视频帧进行特征提取;利用所述第一中间视频帧到所述第一视频帧的光流对所述第二特征提取网络提取得到的特征图进行后向映射;将映射得到的映射特征图、所述第一映射视频帧、所述第二映射视频帧以及所述第一中间视频帧到所述第一视频帧的光流输入至所述编码器进行特征提取;利用所述解码器根据所述编码器提取到的特征预测出第一图像修正项以及第一融合掩膜。In an implementation of the video frame insertion apparatus 300, the third neural network includes a second feature extraction network and a codec network, the codec network includes an encoder and a decoder, and the first intermediate frame determination unit 340, based on the first mapped video frame, the second mapped video frame, and the optical flow from the first intermediate video frame to the first video frame, use a third neural network to predict a first image correction term and a first image correction term. a fusion mask, comprising: using the second feature extraction network to perform feature extraction on the first video frame and the first video frame respectively; using the difference between the first intermediate video frame and the first video frame The optical flow performs backward mapping on the feature map extracted by the second feature extraction network; the mapped feature map obtained by mapping, the first mapped video frame, the second mapped video frame and the first intermediate video are mapped backwards; The optical flow from the frame to the first video frame is input to the encoder for feature extraction; the decoder predicts a first image modification item and a first fusion mask according to the features extracted by the encoder.
在视频插帧装置300的一种实现方式中,第一中间帧确定单元340根据所述第一映射视频帧和所述第二映射视频帧确定所述第一中间视频帧,包括:基于所述第一映射视频帧、所述第二映射视频帧以及所述第一中间视频帧到所述第一视频帧的光流,利用第四神经网络预测出第二融合掩膜;根据所述第二融合掩膜中像素值的指示将所述第一映射视频帧和所述第二映射视频帧融合为所述第一中间视频帧。In an implementation of the apparatus 300 for video frame insertion, the first intermediate frame determining unit 340 determines the first intermediate video frame according to the first mapped video frame and the second mapped video frame, including: based on the The optical flow from the first mapped video frame, the second mapped video frame, and the first intermediate video frame to the first video frame, and a fourth neural network is used to predict a second fusion mask; according to the second The indication of pixel values in the fusion mask fuses the first mapped video frame and the second mapped video frame into the first intermediate video frame.
本申请实施例提供的模型训练装置300,其实现原理及产生的技术效果在前述方法实施例中已经介绍,为简要描述,装置实施例部分未提及之处,可参考方法实施例中相应内容。For the model training device 300 provided by the embodiments of the present application, the implementation principle and the technical effects produced have been introduced in the foregoing method embodiments. For brief description, for the parts not mentioned in the device embodiments, reference may be made to the corresponding content in the method embodiments. .
图10示出了本申请实施例提供的模型训练装置400的功能模块图。参照图10,模型训练装置400包括:FIG. 10 shows a functional block diagram of a model training apparatus 400 provided by an embodiment of the present application. 10, the model training device 400 includes:
第二视频帧获取单元410,用于获取训练样本,所述训练样本包括第三视频帧、第四视频帧以及位于所述第三视频帧与所述第四视频帧之间的参考视频帧;A second video frame obtaining unit 410, configured to obtain training samples, where the training samples include a third video frame, a fourth video frame, and a reference video frame located between the third video frame and the fourth video frame;
第二光流计算单元420,用于基于所述第三视频帧和所述第四视频帧,利用第一神经网络计算第二中间视频帧到所述第三视频帧的光流和/或第二中间视频帧到所述第四视频帧的光流;其中,所述第二中间视频帧为待插入所述第三视频帧和所述第四视频帧之间的视频帧;The second optical flow calculation unit 420 is configured to use the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame and/or the third video frame based on the third video frame and the fourth video frame. Optical flow from two intermediate video frames to the fourth video frame; wherein, the second intermediate video frame is a video frame to be inserted between the third video frame and the fourth video frame;
第二后向映射单元430,用于利用所述第二中间视频帧到所述第三视频帧的光流对所述第三视频帧进行后向映射,获得第三映射视频帧,和/或,利用所述第二中间视频帧到所述第四视频帧的光流对所述第四视频帧进行后向映射,获得第四映射视频帧;A second backward mapping unit 430, configured to perform backward mapping on the third video frame by using the optical flow from the second intermediate video frame to the third video frame to obtain a third mapped video frame, and/or , using the optical flow from the second intermediate video frame to the fourth video frame to perform backward mapping on the fourth video frame to obtain a fourth mapped video frame;
第二中间帧确定单元440,用于根据所述第三映射视频帧和/或所述第四映射视频帧确定所述第二中间视频帧;A second intermediate frame determining unit 440, configured to determine the second intermediate video frame according to the third mapped video frame and/or the fourth mapped video frame;
参数更新单元450,用于根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络的参数。A parameter updating unit 450, configured to calculate a prediction loss according to the second intermediate video frame and the reference video frame, and update parameters of the first neural network according to the prediction loss.
在模型训练装置400的一种实现方式中,参数更新单元450根据所述第二中间视频帧和所述参考视频帧计算预测损失,包括:根据所述第二中间视频帧与所述参考视频帧的差异计算第一损失;分别计算所述第二中间视频帧的图像梯度以及所述参考视频帧的图像梯度,并根据所述第二中间视频帧的图像梯度与所述参考视频帧的图像梯度的差异计算第二损失;根据所述第一损失以及所述第二损失计算所述预测损失。In an implementation manner of the model training apparatus 400, the parameter updating unit 450 calculates the prediction loss according to the second intermediate video frame and the reference video frame, including: according to the second intermediate video frame and the reference video frame Calculate the first loss; calculate the image gradient of the second intermediate video frame and the image gradient of the reference video frame respectively, and calculate the image gradient of the second intermediate video frame and the reference video frame Calculate the second loss; calculate the predicted loss according to the first loss and the second loss.
在模型训练装置400的一种实现方式中,参数更新单元450根据所述第二中间视频帧和所述参考视频帧计算预测损失,包括:根据所述第二中间视频帧与所述参考视频帧的差异计算第一损失;利用预训练的第五神经网络计算出所述参考视频帧到所述第三视频帧的光流和/或所述参考视频帧到所述第四视频帧的光流;根据所述第一神经网络计算出的光流与所述第五神经网络计算出的对应光流之间的差异计算第三损失;根据所述第一损失以及所述第三损失计算所述预测损失。In an implementation manner of the model training apparatus 400, the parameter updating unit 450 calculates the prediction loss according to the second intermediate video frame and the reference video frame, including: according to the second intermediate video frame and the reference video frame Calculate the first loss; use the pre-trained fifth neural network to calculate the optical flow from the reference video frame to the third video frame and/or the optical flow from the reference video frame to the fourth video frame ; Calculate the third loss according to the difference between the optical flow calculated by the first neural network and the corresponding optical flow calculated by the fifth neural network; Calculate the said first loss and the third loss Predict loss.
在模型训练装置400的一种实现方式中,所述第一神经网络包括依次连接的至少一个光流计算模块,每个光流计算模块均输出经该模块修正后的所述第二中间视频帧到所述第三视频帧的光流;参数更新单元450根据所述第二中间视频帧和所述参考视频帧计算预测损失,包括:根据所述第二中间视频帧与所述参考视频帧的差异计算第一损失;利用预训练的第五神经网络计算出所述参考视频帧到所述第三视频帧的光流;根据每个光流计算模块输出的光流与所述第五神经网络计算出的光流之间的差异计算第四损失;根据所述第一损失以及所述第四损失计算所述预测损失。In an implementation of the model training device 400, the first neural network includes at least one optical flow calculation module connected in sequence, and each optical flow calculation module outputs the second intermediate video frame corrected by the module The optical flow to the third video frame; the parameter update unit 450 calculates the prediction loss according to the second intermediate video frame and the reference video frame, including: according to the second intermediate video frame and the reference video frame. Calculate the first loss by difference; use the pre-trained fifth neural network to calculate the optical flow from the reference video frame to the third video frame; according to the optical flow output by each optical flow calculation module and the fifth neural network A fourth loss is calculated from the difference between the calculated optical flows; the predicted loss is calculated according to the first loss and the fourth loss.
在模型训练装置400的一种实现方式中,参数更新单元450根据所述第一神经网络计算出的光流与所述第五神经网络计算出的对应光流之间的差异计算第三损失,包括:利用所述第五神经网络计算出的光流对所述第三视频帧进行后向映射,获得第五映射视频帧;根据所述第五映射视频帧与所述参考视频帧的差异确定所述第五神经网络计算出的每个像素位置处的光流向量是否准确;根据所述第五神经网络计算出光流中的第一有效光流向量与所述第一神经网络计算出的对应光流中的第二有效光流向量的差异计算第三损失;其中,所述第一有效光流向量是指所述第五神经网络计算准确的光流向量,所述第二有效光流向量是指所述第一神经网络计算出的对应光流中位于所述第一有效光流向量对应的像素位置处的光流向量。In an implementation manner of the model training device 400, the parameter updating unit 450 calculates the third loss according to the difference between the optical flow calculated by the first neural network and the corresponding optical flow calculated by the fifth neural network, Including: using the optical flow calculated by the fifth neural network to perform backward mapping on the third video frame to obtain a fifth mapped video frame; determining according to the difference between the fifth mapped video frame and the reference video frame Whether the optical flow vector at each pixel position calculated by the fifth neural network is accurate; the first effective optical flow vector in the optical flow calculated according to the fifth neural network corresponds to the first effective optical flow vector calculated by the first neural network The difference of the second effective optical flow vector in the optical flow calculates the third loss; wherein, the first effective optical flow vector refers to the accurate optical flow vector calculated by the fifth neural network, and the second effective optical flow vector Refers to the optical flow vector located at the pixel position corresponding to the first effective optical flow vector in the corresponding optical flow calculated by the first neural network.
在模型训练装置400的一种实现方式中,参数更新单元450根据每个光流计算模块输出的光流与所述第五神经网络计算出的光流之间的差异计算第四损失,包括:利用所述第五神经网络计算出的光流对所述第三视频帧进行后向映射,获得第五映射视频帧;根据所述第五映射视频帧与所述参考视频帧的差异确定所述第五神经网络计算出的每个像素位置处的光流向量是否准确;根据所述第五神经网络计算出光流中的第一有效光流向量与每个光流计算模块输出的光流中的第三有效光流向量的差异计算第四损失;其中,所述第一有效光流向量是指所述第五神经网络计算准确的光流向量,所述第三有效光流向量是指每个光流计算模块输出的光流中位于所述第一有效光流向量对应的像素位置处的光流向量。In an implementation manner of the model training apparatus 400, the parameter updating unit 450 calculates the fourth loss according to the difference between the optical flow output by each optical flow calculation module and the optical flow calculated by the fifth neural network, including: Use the optical flow calculated by the fifth neural network to perform backward mapping on the third video frame to obtain a fifth mapped video frame; determine the third video frame according to the difference between the fifth mapped video frame and the reference video frame Whether the optical flow vector at each pixel position calculated by the fifth neural network is accurate; the first effective optical flow vector in the optical flow and the optical flow output by each optical flow calculation module are calculated according to the fifth neural network. The difference of the third effective optical flow vector calculates the fourth loss; wherein, the first effective optical flow vector refers to the accurate optical flow vector calculated by the fifth neural network, and the third effective optical flow vector refers to each The optical flow vector at the pixel position corresponding to the first effective optical flow vector in the optical flow output by the optical flow calculation module.
在模型训练装置400的一种实现方式中,所述第一神经网络包括依次连接的至少一个光流计算模块,每个光流计算模块均利用LiteFlownet网络中的描述子匹配单元、亚像素修正层以及正则化层对输入该光流计算模块的光流进行修正,所述装置还包括:参数初始化单元,用于在第二光流计算单元420基于所述第三视频帧和所述第四视频帧,利用第一神经网络计算第二中间视频帧到所述第三视频帧的光流和/或第二中间视频帧到所述第四视频帧的光流之前,利用LiteFlownet网络预训练得到的参数对所述第一神经网络的参数进行初始化。In an implementation of the model training device 400, the first neural network includes at least one optical flow calculation module connected in sequence, and each optical flow calculation module utilizes a descriptor matching unit and a sub-pixel correction layer in the LiteFlownet network and the regularization layer modifies the optical flow input to the optical flow calculation module, the device further includes: a parameter initialization unit, used for the second optical flow calculation unit 420 based on the third video frame and the fourth video frame, before using the first neural network to calculate the optical flow from the second intermediate video frame to the third video frame and/or the optical flow from the second intermediate video frame to the fourth video frame, using the LiteFlownet network pre-training obtained The parameters initialize the parameters of the first neural network.
在模型训练装置400的一种实现方式中,第二中间帧确定单元440根据所述第三映射视频帧和所述第四映射视频帧确定所述第二中间视频帧,包括:基于所述第三映射视频帧、所述第四映射视频帧以及所述第二中间视频帧到所述第三视频帧的光流,利用第三神经网络预测第二图像修正项以及第三融合掩膜;根据所述第三融合掩膜中像素值的指示将所述第三映射视频帧和所述第四映射视频帧融合为所述第二融合视频帧;利用所述第二图像修正项对所述第二融合视频帧进行修正,获得所述第二中间视频帧;参数更新单元450根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络的参数,包括:根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络以及所述第三神经网络的参数。In an implementation of the model training apparatus 400, the second intermediate frame determining unit 440 determines the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame, including: based on the third mapped video frame Three mapped video frames, the fourth mapped video frame, and the optical flow from the second intermediate video frame to the third video frame, using the third neural network to predict the second image correction term and the third fusion mask; The indication of pixel values in the third fusion mask fuses the third mapped video frame and the fourth mapped video frame into the second fused video frame; using the second image correction item to fuse the first Two fused video frames are modified to obtain the second intermediate video frame; the parameter updating unit 450 calculates a prediction loss according to the second intermediate video frame and the reference video frame, and updates the first neural network according to the prediction loss The parameters of the network include: calculating a prediction loss according to the second intermediate video frame and the reference video frame, and updating the parameters of the first neural network and the third neural network according to the prediction loss.
在模型训练装置400的一种实现方式中,第二中间帧确定单元440根据所述第三映射视频帧和所述第四映射视频帧确定所述第二中间视频帧,包括:基于所述第三映射视频帧、所述第四映射视频帧以及所述第二中间视频帧到所述第三视频帧的光流,利用第四神经网络预测第二图像修正项以及第四融合掩膜;根据所述第四融合掩膜中像素值的指示将所述第三映射视频帧和所述第四映射视频帧融合为所述第二中间视频帧;参数更新单元450根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络的参数,包括:根据所述第二中间视频帧和所述参考视频帧计算预测 损失,并根据所述预测损失更新所述第一神经网络以及所述第四神经网络的参数。In an implementation of the model training apparatus 400, the second intermediate frame determining unit 440 determines the second intermediate video frame according to the third mapped video frame and the fourth mapped video frame, including: based on the third mapped video frame Three mapped video frames, the fourth mapped video frame, and the optical flow from the second intermediate video frame to the third video frame, and the fourth neural network is used to predict the second image correction term and the fourth fusion mask; The indication of pixel values in the fourth fusion mask fuses the third mapped video frame and the fourth mapped video frame into the second intermediate video frame; the parameter update unit 450 fuses the second intermediate video frame according to the calculating a prediction loss with the reference video frame, and updating the parameters of the first neural network according to the prediction loss, including: calculating a prediction loss according to the second intermediate video frame and the reference video frame, and The prediction loss updates the parameters of the first neural network and the fourth neural network.
本申请实施例提供的模型训练装置400,其实现原理及产生的技术效果在前述方法实施例中已经介绍,为简要描述,装置实施例部分未提及之处,可参考方法实施例中相应内容。For the model training device 400 provided by the embodiments of the present application, the implementation principle and the technical effects produced have been introduced in the foregoing method embodiments. For brief description, for the parts not mentioned in the device embodiments, reference may be made to the corresponding content in the method embodiments. .
本申请实施例还提供一种视频插帧装置,包括:The embodiment of the present application also provides a video frame insertion device, including:
第三视频帧获取单元,用于获取第一视频帧和第二视频帧;A third video frame obtaining unit, used for obtaining the first video frame and the second video frame;
第三光流计算单元,用于基于所述第一视频帧和所述第二视频帧,利用第一神经网络估计计算出所述第一视频帧到第一中间视频帧的光流和/或所述第二视频帧到第一中间视频帧的光流;其中,所述第一中间视频帧为待插入所述第一视频帧和所述第二视频帧之间的视频帧;A third optical flow calculation unit, configured to calculate the optical flow and/or the optical flow from the first video frame to the first intermediate video frame by using the first neural network estimation based on the first video frame and the second video frame. The optical flow from the second video frame to the first intermediate video frame; wherein, the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame;
第一前向映射单元,用于利用所述第一视频帧到所述第一中间视频帧的光流对所述第一视频帧进行前向映射,获得第一映射视频帧,和/或,利用所述第二视频帧到所述第一中间视频帧的光流对所述第二视频帧进行前向映射,获得第二映射视频帧;a first forward mapping unit, configured to perform forward mapping on the first video frame by using the optical flow from the first video frame to the first intermediate video frame to obtain a first mapped video frame, and/or, Using the optical flow from the second video frame to the first intermediate video frame to perform forward mapping on the second video frame to obtain a second mapped video frame;
第三中间帧确定单元,用于根据所述第一映射视频帧和/或所述第二映射视频帧确定所述第一中间视频帧。A third intermediate frame determining unit, configured to determine the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame.
上述视频插帧装置和视频插帧装置300比较类似,其区别主要包括采用前向映射替代了视频插帧装置300中的后向映射,该视频插帧装置的各种可能的实现方式也可以参考视频插帧装置300,不再重复阐述。The above-mentioned video frame insertion device is similar to the video frame insertion device 300, and the difference mainly includes the use of forward mapping to replace the backward mapping in the video frame insertion device 300. Various possible implementations of the video frame insertion device can also refer to The video frame insertion apparatus 300 will not be repeated.
本申请实施例还提供一种模型训练装置,包括:The embodiment of the present application also provides a model training device, including:
第四视频帧获取单元,用于获取训练样本,所述训练样本包括第三视频帧、第四视频帧以及位于所述第三视频帧与所述第四视频帧之间的参考视频帧;a fourth video frame obtaining unit, configured to obtain a training sample, the training sample includes a third video frame, a fourth video frame, and a reference video frame located between the third video frame and the fourth video frame;
第四光流计算单元,用于基于所述第三视频帧和所述第四视频帧,利用第一神经网络计算所述第三视频帧到第二中间视频帧的光流和/或所述第四视频帧到第二中间视频帧的光流;其中,所述第二中间视频帧为待插入所述第三视频帧和所述第四视频帧之间的视频帧;a fourth optical flow calculation unit, configured to use the first neural network to calculate the optical flow from the third video frame to the second intermediate video frame and/or the optical flow based on the third video frame and the fourth video frame The optical flow from the fourth video frame to the second intermediate video frame; wherein, the second intermediate video frame is the video frame to be inserted between the third video frame and the fourth video frame;
第二前向映射单元,用于利用第三视频帧到所述第二中间视频帧的光流对所述第三视频帧进行前向映射,获得第三映射视频帧,和/或,利用第四视频帧到所述第二中间视频帧的光流对所述第四视频帧进行前向映射,获得第四映射视频帧;a second forward mapping unit, configured to perform forward mapping on the third video frame by using the optical flow from the third video frame to the second intermediate video frame to obtain a third mapped video frame, and/or, use the first The optical flow from the four video frames to the second intermediate video frame performs forward mapping on the fourth video frame to obtain a fourth mapped video frame;
第三中间帧确定单元,用于根据所述第三映射视频帧和/或所述第四映射视频帧确定所述第二中间视频帧;a third intermediate frame determining unit, configured to determine the second intermediate video frame according to the third mapped video frame and/or the fourth mapped video frame;
第二参数更新单元,用于根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络的参数。A second parameter updating unit, configured to calculate a prediction loss according to the second intermediate video frame and the reference video frame, and update the parameters of the first neural network according to the prediction loss.
上述模型训练装置和模型训练装置400比较类似,其区别主要包括采用前向映射替代了模型训练装置400中的后向映射,该模型训练装置的各种可能的实现方式也可以参考模型训练装置400,不再重复阐述。The above-mentioned model training device is similar to the model training device 400, and the difference mainly includes using forward mapping to replace the backward mapping in the model training device 400. Various possible implementations of the model training device can also refer to the model training device 400. , will not be repeated.
图11示出了本申请实施例提供的电子设备500的一种可能的结构。参照图11,电子设备500包括:处理器510、存储器520以及通信接口530,这些组件通过通信总线540和/或其他形式的连接机构(未示出)互连并相互通讯。FIG. 11 shows a possible structure of the electronic device 500 provided by the embodiment of the present application. 11, an electronic device 500 includes a processor 510, a memory 520, and a communication interface 530, and these components are interconnected and communicate with each other through a communication bus 540 and/or other forms of connection mechanisms (not shown).
其中,存储器520包括一个或多个(图中仅示出一个),其可以是,但不限于,随机存取存储器(Random Access Memory,简称RAM),只读存储器(Read Only Memory,简称ROM),可编程只读存储器(Programmable Read-Only Memory,简称PROM),可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,简称EPROM),电可擦除可编程只读存储器(Electric Erasable Programmable Read-Only Memory,简称EEPROM)等。处理器510以及其他可能的组件可对存储器520进行访问,读和/或写其中的数据。Wherein, the memory 520 includes one or more (only one is shown in the figure), which may be, but not limited to, a random access memory (Random Access Memory, RAM for short), a read only memory (Read Only Memory, ROM for short) , Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), Electrical Erasable Programmable Read-Only Memory, referred to as EEPROM) and so on. The processor 510 and possibly other components may access the memory 520, read and/or write data therein.
处理器510包括一个或多个(图中仅示出一个),其可以是一种集成电路芯片,具有信号的处理能力。上述的处理器510可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、微控制单元(Micro Controller Unit,简称MCU)、网络处理器(Network Processor,简称NP)或者其他常规处理器;还可以是专用处理器,包括图形处理器(Graphics Processing Unit,GPU)、神经网络处理器(Neural-network Processing Unit,简称NPU)、数字信号处理器(Digital Signal Processor,简称DSP)、 专用集成电路(Application Specific Integrated Circuits,简称ASIC)、现场可编程门阵列(Field Programmable Gate Array,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。并且,在处理器510为多个时,其中的一部分可以是通用处理器,另一部分可以是专用处理器。The processor 510 includes one or more (only one is shown in the figure), which may be an integrated circuit chip, and has signal processing capability. The above-mentioned processor 510 may be a general-purpose processor, including a central processing unit (Central Processing Unit, referred to as CPU), a micro control unit (Micro Controller Unit, referred to as MCU), a network processor (Network Processor, referred to as NP) or other conventional processing It can also be a dedicated processor, including a graphics processor (Graphics Processing Unit, GPU), a neural network processor (Neural-network Processing Unit, referred to as NPU), a digital signal processor (Digital Signal Processor, referred to as DSP), dedicated Integrated circuits (Application Specific Integrated Circuits, referred to as ASIC), Field Programmable Gate Array (Field Programmable Gate Array, referred to as FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Also, when there are multiple processors 510, some of them may be general-purpose processors, and the other may be dedicated processors.
通信接口530包括一个或多个(图中仅示出一个),可以用于和其他设备进行直接或间接地通信,以便进行数据的交互。通信接口530可以包括进行有线和/或无线通信的接口。The communication interface 530 includes one or more (only one is shown in the figure), which can be used to communicate directly or indirectly with other devices for data exchange. Communication interface 530 may include an interface for wired and/or wireless communication.
在存储器520中可以存储一个或多个计算机程序指令,处理器510可以读取并运行这些计算机程序指令,以实现本申请实施例提供的视频插帧方法和/或模型训练方法。One or more computer program instructions may be stored in the memory 520, and the processor 510 may read and execute these computer program instructions to implement the video frame insertion method and/or the model training method provided by the embodiments of the present application.
可以理解,图11所示的结构仅为示意,电子设备500还可以包括比图11中所示更多或者更少的组件,或者具有与图11所示不同的配置。图11中所示的各组件可以采用硬件、软件或其组合实现。电子设备500可能是实体设备,例如PC机、笔记本电脑、平板电脑、手机、服务器、嵌入式设备等,也可能是虚拟设备,例如虚拟机、虚拟化容器等。并且,电子设备500也不限于单台设备,也可以是多台设备的组合或者大量设备构成的集群。It can be understood that the structure shown in FIG. 11 is only for illustration, and the electronic device 500 may further include more or less components than those shown in FIG. 11 , or have different configurations from those shown in FIG. 11 . Each component shown in FIG. 11 may be implemented in hardware, software, or a combination thereof. The electronic device 500 may be a physical device, such as a PC, a notebook computer, a tablet computer, a mobile phone, a server, an embedded device, etc., or a virtual device, such as a virtual machine, a virtualized container, and the like. In addition, the electronic device 500 is not limited to a single device, and may be a combination of a plurality of devices or a cluster composed of a large number of devices.
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序指令,所述计算机程序指令被计算机的处理器读取并运行时,执行本申请实施例提供的视频插帧方法。例如,计算机可读存储介质可以实现为图11中电子设备500中的存储器520。Embodiments of the present application further provide a computer-readable storage medium, where computer program instructions are stored on the computer-readable storage medium, and when the computer program instructions are read and run by a processor of a computer, the computer program instructions provided by the embodiments of the present application are executed. Video frame insertion method. For example, the computer-readable storage medium may be implemented as the memory 520 in the electronic device 500 in FIG. 11 .
以上所述仅为本申请的实施例而已,并不用于限制本申请的保护范围,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above descriptions are merely examples of the present application, and are not intended to limit the protection scope of the present application. For those skilled in the art, various modifications and changes may be made to the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the protection scope of this application.
工业实用性Industrial Applicability
本申请提供了一种视频插帧方法、模型训练方法及对应装置,其应用于视频处理的过程,实现较好的插帧效果,并且插帧过程耗时短。The present application provides a video frame insertion method, a model training method and a corresponding device, which are applied to the process of video processing to achieve a better frame insertion effect, and the frame insertion process takes less time.

Claims (20)

  1. 一种视频插帧方法,其特征在于,包括:A method for video frame insertion, comprising:
    获取第一视频帧和第二视频帧;Get the first video frame and the second video frame;
    基于所述第一视频帧和所述第二视频帧,利用第一神经网络计算出第一中间视频帧到所述第一视频帧的光流和/或第一中间视频帧到所述第二视频帧的光流;其中,所述第一中间视频帧为待插入所述第一视频帧和所述第二视频帧之间的视频帧;Based on the first video frame and the second video frame, using the first neural network to calculate the optical flow from the first intermediate video frame to the first video frame and/or the first intermediate video frame to the second video frame Optical flow of video frames; wherein, the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame;
    利用所述第一中间视频帧到所述第一视频帧的光流对所述第一视频帧进行后向映射,获得第一映射视频帧,和/或,利用所述第一中间视频帧到所述第二视频帧的光流对所述第二视频帧进行后向映射,获得第二映射视频帧;Perform backward mapping on the first video frame by using the optical flow from the first intermediate video frame to the first video frame to obtain a first mapped video frame, and/or, use the first intermediate video frame to The optical flow of the second video frame performs backward mapping on the second video frame to obtain a second mapped video frame;
    根据所述第一映射视频帧和/或所述第二映射视频帧确定所述第一中间视频帧。The first intermediate video frame is determined from the first mapped video frame and/or the second mapped video frame.
  2. 根据权利要求1所述的视频插帧方法,其特征在于,所述第一神经网络包括依次连接的至少一个光流计算模块,基于所述第一视频帧和所述第二视频帧,利用第一神经网络计算出第一中间视频帧到所述第一视频帧的光流,包括:The video frame insertion method according to claim 1, wherein the first neural network comprises at least one optical flow calculation module connected in sequence, and based on the first video frame and the second video frame, using the first A neural network calculates the optical flow from the first intermediate video frame to the first video frame, including:
    根据所述第一视频帧确定输入每个光流计算模块的第一图像,以及,根据所述第二视频帧确定输入每个光流计算模块的第二图像;Determine the first image input to each optical flow calculation module according to the first video frame, and determine the second image input to each optical flow calculation module according to the second video frame;
    利用每个光流计算模块,基于输入该光流计算模块的光流分别对输入该光流计算模块的第一图像和第二图像进行后向映射,基于映射得到的第一映射图像和第二映射图像对输入该光流计算模块的光流进行修正,并输出修正后的光流;Using each optical flow calculation module, the first image and the second image input to the optical flow calculation module are respectively backward mapped based on the optical flow input to the optical flow calculation module, and the first mapped image and the second image obtained based on the mapping The mapping image corrects the optical flow input to the optical flow calculation module, and outputs the corrected optical flow;
    其中,输入第一个光流计算模块的光流为所述第一视频帧与所述第一中间视频帧之间的预设光流,输入其他光流计算模块的光流为上一个光流计算模块输出的光流,最后一个光流计算模块输出的光流为所述第一神经网络计算出的所述第一中间视频帧到所述第一视频帧的光流。The optical flow input to the first optical flow calculation module is the preset optical flow between the first video frame and the first intermediate video frame, and the optical flow input to other optical flow calculation modules is the previous optical flow The optical flow output by the calculation module, and the optical flow output by the last optical flow calculation module is the optical flow calculated by the first neural network from the first intermediate video frame to the first video frame.
  3. 根据权利要求2所述的视频插帧方法,其特征在于,所述根据所述第一视频帧确定输入每个光流计算模块的第一图像,以及,根据所述第二视频帧确定输入每个光流计算模块的第二图像,包括:The video frame insertion method according to claim 2, wherein the first image input to each optical flow calculation module is determined according to the first video frame, and the input image of each optical flow calculation module is determined according to the second video frame. The second image of the optical flow calculation module, including:
    将所述第一视频帧作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧作为输入每个光流计算模块的第二图像;或者,Using the first video frame as the first image input to each optical flow calculation module, and using the second video frame as the second image input to each optical flow calculation module; or,
    将所述第一视频帧下采样后得到的图像作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧下采样后得到的图像作为输入每个光流计算模块的第二图像;其中,同一光流计算模块输入的两个下采样图像的形状相同;或者,The image obtained after downsampling the first video frame is used as the first image input to each optical flow calculation module, and the image obtained after downsampling the second video frame is used as the input image of each optical flow calculation module. the second image; wherein, the shapes of the two down-sampled images input by the same optical flow calculation module are the same; or,
    将所述第一视频帧经卷积层处理后输出的特征图作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧经卷积层处理后输出的特征图作为输入每个光流计算模块的第二图像;其中,同一光流计算模块输入的两个特征图的形状相同。The feature map output after the first video frame is processed by the convolution layer is used as the first image input to each optical flow calculation module, and the feature map output after the second video frame is processed by the convolution layer is used as Input the second image of each optical flow calculation module; wherein, the shapes of the two feature maps input by the same optical flow calculation module are the same.
  4. 根据权利要求3所述的视频插帧方法,其特征在于,所述将所述第一视频帧下采样后得到的图像作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧下采样后得到的图像作为输入每个光流计算模块的第二图像,包括:The video frame insertion method according to claim 3, wherein the image obtained after down-sampling the first video frame is used as the first image input to each optical flow calculation module, and the The image obtained after the downsampling of the two video frames is used as the second image input to each optical flow calculation module, including:
    分别对所述第一视频帧和所述第二视频帧进行下采样,形成所述第一视频帧的图像金字塔和所述第二视频帧的图像金字塔,所述图像金字塔从顶层开始的每一层对应所述第一神经网络从第一个光流计算模块开始的一个光流计算模块;Downsampling the first video frame and the second video frame, respectively, to form an image pyramid of the first video frame and an image pyramid of the second video frame, each image pyramid starting from the top layer. The layer corresponds to an optical flow calculation module of the first neural network starting from the first optical flow calculation module;
    从两个图像金字塔的顶层开始向下进行逐层遍历,将位于同层的两个下采样图像分别作为输入该层对应的光流计算模块的第一图像和第二图像。Starting from the top layer of the two image pyramids, the traversal is performed layer by layer, and the two down-sampled images located in the same layer are used as the first image and the second image input to the optical flow calculation module corresponding to the layer respectively.
  5. 根据权利要求3所述的视频插帧方法,其特征在于,所述将所述第一视频帧经卷积层处理后输出的特征图作为输入每个光流计算模块的第一图像,以及,将所述第二视频帧经卷积层处理后输出的特征图作为输入每个光流计算模块的第二图像,包括:The video frame insertion method according to claim 3, wherein the feature map output after the first video frame is processed by the convolution layer is used as the first image input to each optical flow calculation module, and, The feature map output after the second video frame is processed by the convolution layer is used as the second image input to each optical flow calculation module, including:
    利用第一特征提取网络分别对所述第一视频帧和所述第二视频帧进行特征提取,形成所述第一视频帧的特征金字塔和所述第二视频帧的特征金字塔,所述特征金字塔从顶层开始的每一层对应所述第一神经网络从第一个光流计算模块开始的一个光流计算模块;其中,所述第一特征提取网络为卷积神经网络;The first feature extraction network is used to extract features from the first video frame and the second video frame, respectively, to form a feature pyramid of the first video frame and a feature pyramid of the second video frame. The feature pyramid Each layer starting from the top layer corresponds to an optical flow calculation module of the first neural network starting from the first optical flow calculation module; wherein, the first feature extraction network is a convolutional neural network;
    从两个特征金字塔的顶层开始向下进行逐层遍历,将位于同层的两个特征图分别作为输入该层对应的光流计算模块的第一图像和第二图像。Starting from the top layer of the two feature pyramids, the traversal is performed layer by layer, and the two feature maps located in the same layer are respectively used as the first image and the second image input to the optical flow calculation module corresponding to the layer.
  6. 根据权利要求2-5中任一项所述的视频插帧方法,其特征在于,所述基于映射得到的第一映射图像和第二映射图像对输入该光流计算模块的光流进行修正,并输出修正后的光流,包括:The video frame insertion method according to any one of claims 2-5, wherein the optical flow input to the optical flow calculation module is modified based on the first mapping image and the second mapping image obtained by mapping, And output the corrected optical flow, including:
    基于映射得到的第一映射图像、第二映射图像以及输入该光流计算模块的光流,利用第二神经网络预测出的光流修正项或利用LiteFlownet网络中的描述子匹配单元、亚像素修正层以及正则化层对输入该光流计算模块的光流进行修正,并输出修正后的光流。Based on the first mapped image, the second mapped image and the optical flow input to the optical flow calculation module, the optical flow correction term predicted by the second neural network or the descriptor matching unit and sub-pixel correction in the LiteFlownet network are used. The layer and the regularization layer correct the optical flow input to the optical flow calculation module, and output the corrected optical flow.
  7. 根据权利要求1-6中任一项所述的视频插帧方法,其特征在于,所述基于所述第一视频帧和所述第二视频帧,利用第一神经网络计算出第一中间视频帧到所述第一视频帧的光流和第一中间视频帧到所述第二视频帧的光流,包括:The video frame insertion method according to any one of claims 1-6, wherein the first intermediate video is calculated based on the first video frame and the second video frame by using a first neural network The optical flow from the frame to the first video frame and the optical flow from the first intermediate video frame to the second video frame, including:
    利用第一神经网络计算出第一中间视频帧到所述第一视频帧的光流,并根据所述第一中间视频帧到所述第一视频帧的光流计算所述第一中间视频帧到所述第二视频帧的光流;或者,Calculate the optical flow from the first intermediate video frame to the first video frame by using the first neural network, and calculate the first intermediate video frame according to the optical flow from the first intermediate video frame to the first video frame the optical flow to the second video frame; or,
    利用第一神经网络计算出第一中间视频帧到所述第二视频帧的光流,并根据所述第一中间视频帧到所述第二视频帧的光流计算所述第一中间视频帧到所述第一视频帧的光流。Calculate the optical flow from the first intermediate video frame to the second video frame by using the first neural network, and calculate the first intermediate video frame according to the optical flow from the first intermediate video frame to the second video frame optical flow to the first video frame.
  8. 根据权利要求7所述的视频插帧方法,其特征在于,所述根据所述第一中间视频帧到所述第一视频帧的光流计算所述第一中间视频帧到所述第二视频帧的光流,包括:The method for video frame insertion according to claim 7, wherein the calculation from the first intermediate video frame to the second video is performed according to an optical flow from the first intermediate video frame to the first video frame. Optical flow of frames, including:
    将所述第一中间视频帧到所述第一视频帧的光流取反后作为所述第一中间视频帧到所述第二视频帧的光流;Inverting the optical flow from the first intermediate video frame to the first video frame as the optical flow from the first intermediate video frame to the second video frame;
    所述根据所述第一中间视频帧到所述第二视频帧的光流计算所述第一中间视频帧到所述第一视频帧的光流,包括:The calculating the optical flow from the first intermediate video frame to the first video frame according to the optical flow from the first intermediate video frame to the second video frame includes:
    将所述第一中间视频帧到所述第二视频帧的光流取反后作为所述第一中间视频帧到所述第一视频帧的光流。Inverting the optical flow from the first intermediate video frame to the second video frame is taken as the optical flow from the first intermediate video frame to the first video frame.
  9. 根据权利要求1-8中任一项所述的视频插帧方法,其特征在于,所述根据所述第一映射视频帧和/或所述第二映射视频帧确定所述第一中间视频帧,包括:The video frame insertion method according to any one of claims 1-8, wherein the first intermediate video frame is determined according to the first mapped video frame and/or the second mapped video frame ,include:
    基于所述第一中间视频帧到所述第一视频帧的光流对所述第一映射视频帧进行修正,获得所述第一中间视频帧;或者,Modify the first mapped video frame based on the optical flow from the first intermediate video frame to the first video frame to obtain the first intermediate video frame; or,
    基于所述第一中间视频帧到所述第二视频帧的光流对所述第二映射视频帧进行修正,获得所述第一中间视频帧;或者,Modify the second mapped video frame based on the optical flow from the first intermediate video frame to the second video frame to obtain the first intermediate video frame; or,
    基于所述第一中间视频帧到所述第一视频帧的光流和/或所述第一中间视频帧到所述第二视频帧的光流,对所述第一映射视频帧和所述第二映射视频帧的融合后形成的第一融合视频帧进行修正,获得所述第一中间视频帧。based on the optical flow of the first intermediate video frame to the first video frame and/or the optical flow of the first intermediate video frame to the second video frame, mapping the first video frame to the second video frame The first fused video frame formed by the fusion of the second mapped video frame is modified to obtain the first intermediate video frame.
  10. 根据权利要求9所述的视频插帧方法,其特征在于,基于所述第一中间视频帧到所述第一视频帧的光流,对所述第一映射视频帧和所述第二映射视频帧的融合后形成的第一融合视频帧进行修正,获得所述第一中间视频帧,包括:The video frame insertion method according to claim 9, wherein based on the optical flow from the first intermediate video frame to the first video frame, the first mapped video frame and the second mapped video are The first fused video frame formed after the fusion of the frames is corrected to obtain the first intermediate video frame, including:
    基于所述第一映射视频帧、所述第二映射视频帧以及所述第一中间视频帧到所述第一视频帧的光流,利用第三神经网络预测出第一图像修正项以及第一融合掩膜;Based on the first mapped video frame, the second mapped video frame, and the optical flow from the first intermediate video frame to the first video frame, a third neural network is used to predict a first image correction term and a first fusion mask;
    根据所述第一融合掩膜中像素值的指示将所述第一映射视频帧和所述第二映射视频帧融合为所述第一融合视频帧;fusing the first mapped video frame and the second mapped video frame into the first fused video frame according to the indication of pixel values in the first fusion mask;
    利用所述第一图像修正项对所述第一融合视频帧进行修正,获得所述第一中间视频帧。The first fused video frame is modified by using the first image modification item to obtain the first intermediate video frame.
  11. 根据权利要求10所述的视频插帧方法,其特征在于,所述第三神经网络包括第二特征提取网络以及编解码器网络,所述编解码器网络包括编码器以及解码器,所述基于所述第一映射视频帧、所述第二映射视频帧以及所述第一中间视频帧到所述第一视频帧的光流,利用第三神经网络预测出第一图像修正项以及第一融合掩膜,包括:The video frame insertion method according to claim 10, wherein the third neural network comprises a second feature extraction network and a codec network, the codec network comprises an encoder and a decoder, and the The optical flow from the first mapped video frame, the second mapped video frame, and the first intermediate video frame to the first video frame, using the third neural network to predict the first image correction term and the first fusion masks, including:
    利用所述第二特征提取网络分别对所述第一视频帧和所述第一视频帧进行特征提取;Use the second feature extraction network to perform feature extraction on the first video frame and the first video frame respectively;
    利用所述第一中间视频帧到所述第一视频帧的光流对所述第二特征提取网络提取得到的特征图进行后向映射;Using the optical flow from the first intermediate video frame to the first video frame to perform backward mapping on the feature map extracted by the second feature extraction network;
    将映射得到的映射特征图、所述第一映射视频帧、所述第二映射视频帧以及所述第一中间视频帧到所述第一视频帧的光流输入至所述编码器进行特征提取;Input the mapped feature map obtained by mapping, the first mapped video frame, the second mapped video frame, and the optical flow from the first intermediate video frame to the first video frame to the encoder for feature extraction ;
    利用所述解码器根据所述编码器提取到的特征预测出第一图像修正项以及第一融合掩膜。Using the decoder to predict the first image correction term and the first fusion mask according to the features extracted by the encoder.
  12. 一种模型训练方法,其特征在于,包括:A model training method, comprising:
    获取训练样本,所述训练样本包括第三视频帧、第四视频帧以及位于所述第三视频帧与所述第四视频帧之间的参考视频帧;acquiring training samples, the training samples including a third video frame, a fourth video frame, and a reference video frame located between the third video frame and the fourth video frame;
    基于所述第三视频帧和所述第四视频帧,利用第一神经网络计算第二中间视频帧到所述第三视频帧的光流和/或第二中间视频帧到所述第四视频帧的光流;其中,所述第二中间视频帧为待插入所述第三视频帧和所述第四视频帧之间的视频帧;Based on the third video frame and the fourth video frame, using the first neural network to calculate the optical flow of the second intermediate video frame to the third video frame and/or the second intermediate video frame to the fourth video frame optical flow; wherein, the second intermediate video frame is a video frame to be inserted between the third video frame and the fourth video frame;
    利用所述第二中间视频帧到所述第三视频帧的光流对所述第三视频帧进行后向映射,获得第三映射视频帧,和/或,利用所述第二中间视频帧到所述第四视频帧的光流对所述第四视频帧进行 后向映射,获得第四映射视频帧;Perform backward mapping on the third video frame by using the optical flow from the second intermediate video frame to the third video frame to obtain a third mapped video frame, and/or, use the second intermediate video frame to The optical flow of the fourth video frame performs backward mapping on the fourth video frame to obtain a fourth mapped video frame;
    根据所述第三映射视频帧和/或所述第四映射视频帧确定所述第二中间视频帧;determining the second intermediate video frame from the third mapped video frame and/or the fourth mapped video frame;
    根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络的参数。A prediction loss is calculated from the second intermediate video frame and the reference video frame, and parameters of the first neural network are updated according to the prediction loss.
  13. 根据权利要求12所述的模型训练方法,其特征在于,所述根据所述第二中间视频帧和所述参考视频帧计算预测损失,包括:The model training method according to claim 12, wherein the calculating the prediction loss according to the second intermediate video frame and the reference video frame comprises:
    根据所述第二中间视频帧与所述参考视频帧的差异计算第一损失;calculating a first loss according to the difference between the second intermediate video frame and the reference video frame;
    分别计算所述第二中间视频帧的图像梯度以及所述参考视频帧的图像梯度,并根据所述第二中间视频帧的图像梯度与所述参考视频帧的图像梯度的差异计算第二损失;respectively calculating the image gradient of the second intermediate video frame and the image gradient of the reference video frame, and calculating a second loss according to the difference between the image gradient of the second intermediate video frame and the image gradient of the reference video frame;
    根据所述第一损失以及所述第二损失计算所述预测损失。The predicted loss is calculated based on the first loss and the second loss.
  14. 根据权利要求12所述的模型训练方法,其特征在于,所述根据所述第二中间视频帧和所述参考视频帧计算预测损失,包括:The model training method according to claim 12, wherein the calculating the prediction loss according to the second intermediate video frame and the reference video frame comprises:
    根据所述第二中间视频帧与所述参考视频帧的差异计算第一损失;calculating a first loss according to the difference between the second intermediate video frame and the reference video frame;
    利用预训练的第五神经网络计算出所述参考视频帧到所述第三视频帧的光流和/或所述参考视频帧到所述第四视频帧的光流;Using the pre-trained fifth neural network to calculate the optical flow from the reference video frame to the third video frame and/or the optical flow from the reference video frame to the fourth video frame;
    根据所述第一神经网络计算出的光流与所述第五神经网络计算出的对应光流之间的差异计算第三损失;calculating a third loss according to the difference between the optical flow calculated by the first neural network and the corresponding optical flow calculated by the fifth neural network;
    根据所述第一损失以及所述第三损失计算所述预测损失。The predicted loss is calculated based on the first loss and the third loss.
  15. 一种视频插帧方法,其特征在于,包括:A method for video frame insertion, comprising:
    获取第一视频帧和第二视频帧;Get the first video frame and the second video frame;
    基于所述第一视频帧和所述第二视频帧,利用第一神经网络估计计算出所述第一视频帧到第一中间视频帧的光流和/或所述第二视频帧到第一中间视频帧的光流;其中,所述第一中间视频帧为待插入所述第一视频帧和所述第二视频帧之间的视频帧;Based on the first video frame and the second video frame, using the first neural network to estimate and calculate the optical flow from the first video frame to the first intermediate video frame and/or the second video frame to the first video frame Optical flow of an intermediate video frame; wherein, the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame;
    利用所述第一视频帧到所述第一中间视频帧的光流对所述第一视频帧进行前向映射,获得第一映射视频帧,和/或,利用所述第二视频帧到所述第一中间视频帧的光流对所述第二视频帧进行前向映射,获得第二映射视频帧;Perform forward mapping on the first video frame by using the optical flow from the first video frame to the first intermediate video frame to obtain a first mapped video frame, and/or, use the second video frame to map the first video frame to the first video frame. The optical flow of the first intermediate video frame performs forward mapping on the second video frame to obtain a second mapped video frame;
    根据所述第一映射视频帧和/或所述第二映射视频帧确定所述第一中间视频帧。The first intermediate video frame is determined from the first mapped video frame and/or the second mapped video frame.
  16. 一种模型训练方法,其特征在于,包括:A model training method, comprising:
    获取训练样本,所述训练样本包括第三视频帧、第四视频帧以及位于所述第三视频帧与所述第四视频帧之间的参考视频帧;acquiring training samples, the training samples including a third video frame, a fourth video frame, and a reference video frame located between the third video frame and the fourth video frame;
    基于所述第三视频帧和所述第四视频帧,利用第一神经网络计算所述第三视频帧到第二中间视频帧的光流和/或所述第四视频帧到第二中间视频帧的光流;其中,所述第二中间视频帧为待插入所述第三视频帧和所述第四视频帧之间的视频帧;Based on the third video frame and the fourth video frame, use the first neural network to calculate the optical flow from the third video frame to the second intermediate video frame and/or the fourth video frame to the second intermediate video frame optical flow; wherein, the second intermediate video frame is a video frame to be inserted between the third video frame and the fourth video frame;
    利用第三视频帧到所述第二中间视频帧的光流对所述第三视频帧进行前向映射,获得第三映射视频帧,和/或,利用第四视频帧到所述第二中间视频帧的光流对所述第四视频帧进行前向映射,获得第四映射视频帧;Forward-mapping the third video frame by using the optical flow from the third video frame to the second intermediate video frame to obtain a third mapped video frame, and/or using the fourth video frame to the second intermediate video frame The optical flow of the video frame performs forward mapping on the fourth video frame to obtain the fourth mapped video frame;
    根据所述第三映射视频帧和/或所述第四映射视频帧确定所述第二中间视频帧;determining the second intermediate video frame from the third mapped video frame and/or the fourth mapped video frame;
    根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络的参数。A prediction loss is calculated from the second intermediate video frame and the reference video frame, and parameters of the first neural network are updated according to the prediction loss.
  17. 一种视频插帧装置,其特征在于,包括:A device for video frame insertion, comprising:
    第一视频帧获取单元,用于获取第一视频帧和第二视频帧;a first video frame obtaining unit, used to obtain the first video frame and the second video frame;
    第一光流计算单元,用于基于所述第一视频帧和所述第二视频帧,利用第一神经网络计算出第一中间视频帧到所述第一视频帧的光流和/或第一中间视频帧到所述第二视频帧的光流;其中,所述第一中间视频帧为待插入所述第一视频帧和所述第二视频帧之间的视频帧;A first optical flow calculation unit, configured to use a first neural network to calculate an optical flow and/or an optical flow from the first intermediate video frame to the first video frame based on the first video frame and the second video frame. An optical flow from an intermediate video frame to the second video frame; wherein, the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame;
    第一后向映射单元,用于利用所述第一中间视频帧到所述第一视频帧的光流对所述第一视频帧进行后向映射,获得第一映射视频帧,和/或,利用所述第一中间视频帧到所述第二视频帧的光流对所述第二视频帧进行后向映射,获得第二映射视频帧;a first backward mapping unit, configured to perform backward mapping on the first video frame by using the optical flow from the first intermediate video frame to the first video frame to obtain a first mapped video frame, and/or, Using the optical flow from the first intermediate video frame to the second video frame to perform backward mapping on the second video frame to obtain a second mapped video frame;
    第一中间帧确定单元,用于根据所述第一映射视频帧和/或所述第二映射视频帧确定所述第一中间视频帧。A first intermediate frame determination unit, configured to determine the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame.
  18. 一种模型训练装置,其特征在于,包括:A model training device, comprising:
    第二视频帧获取单元,用于获取训练样本,所述训练样本包括第三视频帧、第四视频帧以及位于所述第三视频帧与所述第四视频帧之间的参考视频帧;a second video frame obtaining unit, configured to obtain training samples, where the training samples include a third video frame, a fourth video frame, and a reference video frame located between the third video frame and the fourth video frame;
    第二光流计算单元,用于基于所述第三视频帧和所述第四视频帧,利用第一神经网络计算第 二中间视频帧到所述第三视频帧的光流和/或第二中间视频帧到所述第四视频帧的光流;其中,所述第二中间视频帧为待插入所述第三视频帧和所述第四视频帧之间的视频帧;A second optical flow calculation unit, configured to use the first neural network to calculate the optical flow and/or the second optical flow from the second intermediate video frame to the third video frame based on the third video frame and the fourth video frame an optical flow from an intermediate video frame to the fourth video frame; wherein, the second intermediate video frame is a video frame to be inserted between the third video frame and the fourth video frame;
    第二后向映射单元,用于利用所述第二中间视频帧到所述第三视频帧的光流对所述第三视频帧进行后向映射,获得第三映射视频帧,和/或,利用所述第二中间视频帧到所述第四视频帧的光流对所述第四视频帧进行后向映射,获得第四映射视频帧;a second backward mapping unit, configured to perform backward mapping on the third video frame by using the optical flow from the second intermediate video frame to the third video frame to obtain a third mapped video frame, and/or, Using the optical flow from the second intermediate video frame to the fourth video frame to perform backward mapping on the fourth video frame to obtain a fourth mapped video frame;
    第二中间帧确定单元,用于根据所述第三映射视频帧和/或所述第四映射视频帧确定所述第二中间视频帧;a second intermediate frame determining unit, configured to determine the second intermediate video frame according to the third mapped video frame and/or the fourth mapped video frame;
    第一参数更新单元,用于根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络的参数。A first parameter updating unit, configured to calculate a prediction loss according to the second intermediate video frame and the reference video frame, and update parameters of the first neural network according to the prediction loss.
  19. 一种视频插帧装置,其特征在于,包括:A device for video frame insertion, comprising:
    第三视频帧获取单元,用于获取第一视频帧和第二视频帧;The third video frame acquisition unit is used to acquire the first video frame and the second video frame;
    第三光流计算单元,用于基于所述第一视频帧和所述第二视频帧,利用第一神经网络估计计算出所述第一视频帧到第一中间视频帧的光流和/或所述第二视频帧到第一中间视频帧的光流;其中,所述第一中间视频帧为待插入所述第一视频帧和所述第二视频帧之间的视频帧;A third optical flow calculation unit, configured to calculate the optical flow and/or the optical flow from the first video frame to the first intermediate video frame by using the first neural network estimation based on the first video frame and the second video frame. The optical flow from the second video frame to the first intermediate video frame; wherein, the first intermediate video frame is a video frame to be inserted between the first video frame and the second video frame;
    第一前向映射单元,用于利用所述第一视频帧到所述第一中间视频帧的光流对所述第一视频帧进行前向映射,获得第一映射视频帧,和/或,利用所述第二视频帧到所述第一中间视频帧的光流对所述第二视频帧进行前向映射,获得第二映射视频帧;a first forward mapping unit, configured to perform forward mapping on the first video frame by using the optical flow from the first video frame to the first intermediate video frame to obtain a first mapped video frame, and/or, Using the optical flow from the second video frame to the first intermediate video frame to perform forward mapping on the second video frame to obtain a second mapped video frame;
    第三中间帧确定单元,用于根据所述第一映射视频帧和/或所述第二映射视频帧确定所述第一中间视频帧。A third intermediate frame determining unit, configured to determine the first intermediate video frame according to the first mapped video frame and/or the second mapped video frame.
  20. 一种模型训练装置,其特征在于,包括:A model training device, comprising:
    第四视频帧获取单元,用于获取训练样本,所述训练样本包括第三视频帧、第四视频帧以及位于所述第三视频帧与所述第四视频帧之间的参考视频帧;a fourth video frame obtaining unit, configured to obtain a training sample, the training sample includes a third video frame, a fourth video frame, and a reference video frame located between the third video frame and the fourth video frame;
    第四光流计算单元,用于基于所述第三视频帧和所述第四视频帧,利用第一神经网络计算所述第三视频帧到第二中间视频帧的光流和/或所述第四视频帧到第二中间视频帧的光流;其中,所述第二中间视频帧为待插入所述第三视频帧和所述第四视频帧之间的视频帧;a fourth optical flow calculation unit, configured to use the first neural network to calculate the optical flow from the third video frame to the second intermediate video frame and/or the optical flow based on the third video frame and the fourth video frame The optical flow from the fourth video frame to the second intermediate video frame; wherein, the second intermediate video frame is the video frame to be inserted between the third video frame and the fourth video frame;
    第二前向映射单元,用于利用第三视频帧到所述第二中间视频帧的光流对所述第三视频帧进行前向映射,获得第三映射视频帧,和/或,利用第四视频帧到所述第二中间视频帧的光流对所述第四视频帧进行前向映射,获得第四映射视频帧;a second forward mapping unit, configured to perform forward mapping on the third video frame by using the optical flow from the third video frame to the second intermediate video frame to obtain a third mapped video frame, and/or, use the first The optical flow from the four video frames to the second intermediate video frame performs forward mapping on the fourth video frame to obtain a fourth mapped video frame;
    第三中间帧确定单元,用于根据所述第三映射视频帧和/或所述第四映射视频帧确定所述第二中间视频帧;a third intermediate frame determining unit, configured to determine the second intermediate video frame according to the third mapped video frame and/or the fourth mapped video frame;
    第二参数更新单元,用于根据所述第二中间视频帧和所述参考视频帧计算预测损失,并根据所述预测损失更新所述第一神经网络的参数。A second parameter updating unit, configured to calculate a prediction loss according to the second intermediate video frame and the reference video frame, and update the parameters of the first neural network according to the prediction loss.
PCT/CN2021/085220 2020-08-13 2021-04-02 Video frame interpolation method, model training method, and corresponding device WO2022033048A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010815538.3A CN112104830B (en) 2020-08-13 2020-08-13 Video frame insertion method, model training method and corresponding device
CN202010815538.3 2020-08-13

Publications (1)

Publication Number Publication Date
WO2022033048A1 true WO2022033048A1 (en) 2022-02-17

Family

ID=73753716

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/085220 WO2022033048A1 (en) 2020-08-13 2021-04-02 Video frame interpolation method, model training method, and corresponding device

Country Status (2)

Country Link
CN (1) CN112104830B (en)
WO (1) WO2022033048A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115883869A (en) * 2022-11-28 2023-03-31 江汉大学 Swin transform-based video frame interpolation model processing method, device and equipment
CN117241065B (en) * 2023-11-14 2024-03-08 腾讯科技(深圳)有限公司 Video plug-in frame image generation method, device, computer equipment and storage medium

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11689693B2 (en) * 2020-04-30 2023-06-27 Boe Technology Group Co., Ltd. Video frame interpolation method and device, computer readable storage medium
CN112104830B (en) * 2020-08-13 2022-09-27 北京迈格威科技有限公司 Video frame insertion method, model training method and corresponding device
CN112954395B (en) * 2021-02-03 2022-05-17 南开大学 Video frame interpolation method and system capable of inserting any frame rate
CN113132664B (en) * 2021-04-19 2022-10-04 科大讯飞股份有限公司 Frame interpolation generation model construction method and video frame interpolation method
CN112995715B (en) * 2021-04-20 2021-09-03 腾讯科技(深圳)有限公司 Video frame insertion processing method and device, electronic equipment and storage medium
CN113298728B (en) * 2021-05-21 2023-01-24 中国科学院深圳先进技术研究院 Video optimization method and device, terminal equipment and storage medium
CN113542651B (en) * 2021-05-28 2023-10-27 爱芯元智半导体(宁波)有限公司 Model training method, video frame inserting method and corresponding devices
CN113469880A (en) * 2021-05-28 2021-10-01 北京迈格威科技有限公司 Image splicing method and device, storage medium and electronic equipment
CN113382247B (en) * 2021-06-09 2022-10-18 西安电子科技大学 Video compression sensing system and method based on interval observation, equipment and storage medium
CN113556582A (en) * 2021-07-30 2021-10-26 海宁奕斯伟集成电路设计有限公司 Video data processing method, device, equipment and storage medium
CN115706810A (en) * 2021-08-16 2023-02-17 北京字跳网络技术有限公司 Video frame adjusting method and device, electronic equipment and storage medium
CN113469930B (en) * 2021-09-06 2021-12-07 腾讯科技(深圳)有限公司 Image processing method and device and computer equipment
CN113837136B (en) * 2021-09-29 2022-12-23 深圳市慧鲤科技有限公司 Video frame insertion method and device, electronic equipment and storage medium
CN113935537A (en) * 2021-10-22 2022-01-14 北京华云星地通科技有限公司 Cloud image interpolation prediction method and system based on deep learning
CN114007135B (en) * 2021-10-29 2023-04-18 广州华多网络科技有限公司 Video frame insertion method and device, equipment, medium and product thereof
CN113891027B (en) * 2021-12-06 2022-03-15 深圳思谋信息科技有限公司 Video frame insertion model training method and device, computer equipment and storage medium
CN114339409B (en) * 2021-12-09 2023-06-20 腾讯科技(上海)有限公司 Video processing method, device, computer equipment and storage medium
CN114422852A (en) * 2021-12-16 2022-04-29 阿里巴巴(中国)有限公司 Video playing method, storage medium, processor and system
CN116684662A (en) * 2022-02-22 2023-09-01 北京字跳网络技术有限公司 Video processing method, device, equipment and medium
CN114640885B (en) * 2022-02-24 2023-12-22 影石创新科技股份有限公司 Video frame inserting method, training device and electronic equipment
CN114862688A (en) * 2022-03-14 2022-08-05 杭州群核信息技术有限公司 Video frame insertion method, device and system based on deep learning
CN115103147A (en) * 2022-06-24 2022-09-23 马上消费金融股份有限公司 Intermediate frame image generation method, model training method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304755A (en) * 2017-03-08 2018-07-20 腾讯科技(深圳)有限公司 The training method and device of neural network model for image procossing
CN109068174A (en) * 2018-09-12 2018-12-21 上海交通大学 Video frame rate upconversion method and system based on cyclic convolution neural network
CN109379550A (en) * 2018-09-12 2019-02-22 上海交通大学 Video frame rate upconversion method and system based on convolutional neural networks
US20190138889A1 (en) * 2017-11-06 2019-05-09 Nvidia Corporation Multi-frame video interpolation using optical flow
CN109905624A (en) * 2019-03-01 2019-06-18 北京大学深圳研究生院 A kind of video frame interpolation method, device and equipment
CN109922231A (en) * 2019-02-01 2019-06-21 重庆爱奇艺智能科技有限公司 A kind of method and apparatus for generating the interleave image of video
CN110798630A (en) * 2019-10-30 2020-02-14 北京市商汤科技开发有限公司 Image processing method and device, electronic equipment and storage medium
CN112104830A (en) * 2020-08-13 2020-12-18 北京迈格威科技有限公司 Video frame insertion method, model training method and corresponding device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101123008B1 (en) * 2010-11-16 2012-03-16 알피니언메디칼시스템 주식회사 Method for imaging color flow images, ultrasound apparatus therefor
CN111405316A (en) * 2020-03-12 2020-07-10 北京奇艺世纪科技有限公司 Frame insertion method, electronic device and readable storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304755A (en) * 2017-03-08 2018-07-20 腾讯科技(深圳)有限公司 The training method and device of neural network model for image procossing
US20190138889A1 (en) * 2017-11-06 2019-05-09 Nvidia Corporation Multi-frame video interpolation using optical flow
CN109068174A (en) * 2018-09-12 2018-12-21 上海交通大学 Video frame rate upconversion method and system based on cyclic convolution neural network
CN109379550A (en) * 2018-09-12 2019-02-22 上海交通大学 Video frame rate upconversion method and system based on convolutional neural networks
CN109922231A (en) * 2019-02-01 2019-06-21 重庆爱奇艺智能科技有限公司 A kind of method and apparatus for generating the interleave image of video
CN109905624A (en) * 2019-03-01 2019-06-18 北京大学深圳研究生院 A kind of video frame interpolation method, device and equipment
CN110798630A (en) * 2019-10-30 2020-02-14 北京市商汤科技开发有限公司 Image processing method and device, electronic equipment and storage medium
CN112104830A (en) * 2020-08-13 2020-12-18 北京迈格威科技有限公司 Video frame insertion method, model training method and corresponding device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115883869A (en) * 2022-11-28 2023-03-31 江汉大学 Swin transform-based video frame interpolation model processing method, device and equipment
CN115883869B (en) * 2022-11-28 2024-04-19 江汉大学 Processing method, device and processing equipment of video frame insertion model based on Swin converter
CN117241065B (en) * 2023-11-14 2024-03-08 腾讯科技(深圳)有限公司 Video plug-in frame image generation method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN112104830A (en) 2020-12-18
CN112104830B (en) 2022-09-27

Similar Documents

Publication Publication Date Title
WO2022033048A1 (en) Video frame interpolation method, model training method, and corresponding device
CN110324664B (en) Video frame supplementing method based on neural network and training method of model thereof
JP7177062B2 (en) Depth Prediction from Image Data Using Statistical Model
WO2021073493A1 (en) Image processing method and device, neural network training method, image processing method of combined neural network model, construction method of combined neural network model, neural network processor and storage medium
CN113542651B (en) Model training method, video frame inserting method and corresponding devices
EP3613018A1 (en) Visual style transfer of images
CN110782490A (en) Video depth map estimation method and device with space-time consistency
CN113034380A (en) Video space-time super-resolution method and device based on improved deformable convolution correction
CN113066017B (en) Image enhancement method, model training method and equipment
CN112561978B (en) Training method of depth estimation network, depth estimation method of image and equipment
WO2023103576A1 (en) Video processing method and apparatus, and computer device and storage medium
WO2023193474A1 (en) Information processing method and apparatus, computer device, and storage medium
CN115578515B (en) Training method of three-dimensional reconstruction model, three-dimensional scene rendering method and device
WO2023160426A1 (en) Video frame interpolation method and apparatus, training method and apparatus, and electronic device
CN115002379B (en) Video frame inserting method, training device, electronic equipment and storage medium
CN114073071A (en) Video frame insertion method and device and computer readable storage medium
CN113538525B (en) Optical flow estimation method, model training method and corresponding devices
CN116071412A (en) Unsupervised monocular depth estimation method integrating full-scale and adjacent frame characteristic information
CN111968208A (en) Human body animation synthesis method based on human body soft tissue grid model
WO2023193491A1 (en) Information processing method and apparatus, and computer device and storage medium
CN114119678A (en) Optical flow estimation method, computer program product, storage medium, and electronic device
CN113469880A (en) Image splicing method and device, storage medium and electronic equipment
Zhu et al. Fused network for view synthesis
CN116012230B (en) Space-time video super-resolution method, device, equipment and storage medium
CN116246026B (en) Training method of three-dimensional reconstruction model, three-dimensional scene rendering method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21855104

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13-07-2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21855104

Country of ref document: EP

Kind code of ref document: A1