CN116193161A

CN116193161A - Video frame inserting method, device and storage medium

Info

Publication number: CN116193161A
Application number: CN202310196193.1A
Authority: CN
Inventors: 王利媛; 陶子豪; 徐伟; 林昊; 张文锋; 林华春
Original assignee: Merchants Union Consumer Finance Co Ltd
Current assignee: Merchants Union Consumer Finance Co Ltd
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-05-30

Abstract

The embodiment of the application provides a video frame inserting method, a device and a storage medium, wherein the method adopts real-time intermediate stream estimation, takes the real-time intermediate stream estimation as network input for different target moments, directly leads the network end-to-end learning intermediate frame optical flow, and reduces the additional cost brought by bidirectional stream estimation; and extracting pyramid feature images with different sizes in the model prediction process, decoding features with different scales, gradually refining from a coarse optical flow with minimum resolution to a fine optical flow with maximum resolution, improving the optical flow presentation effect of the intermediate frame image, and finally taking the optical flow estimated according to each layer of feature image as the input of a next layer decoding network in the prediction process to provide anchor point information for the next layer decoding network, promoting the estimation of each layer of intermediate optical flow and improving the final frame inserting effect.

Description

Video frame inserting method, device and storage medium

Technical Field

The application relates to the field of multimedia technology, and further relates to application of artificial intelligence (Artificial intelligence, ai) in the field of multimedia technology, in particular to a video frame inserting method, a video frame inserting device and a storage medium.

Background

With the rise of mobile internet technology, various forms of video software opens up a video era for audiences, and the video is endowed with new era connotation. Important features of video are resolution and frame rate, which represent the video spatial resolution and temporal resolution, respectively. The frame rate, which is one of the important features of video, represents the number of frames per second that the video plays. The phenomenon of jamming and jumping can occur in the low-frame-rate video, discomfort is brought to eyes, and visual experience is affected. High frame rate video contains more information per second and the video look and feel is smoother. Along with the continuous popularization of high-definition high-refresh rate display equipment, the demand of people for video quality is also higher and higher, and the problem of frame rate improvement becomes a research hot spot at home and abroad, and the video frame generation method is a key technology for realizing frame rate improvement.

In the related art, a frame insertion technology is generally adopted to realize splicing and transferring of video materials. The frame inserting (Video Frame interpolation, VFi) technology, also called frame rate conversion (Frame RateConversion) technology, is to add one or more frames (intermediate frame images) to every two continuous frames of an original video, so as to shorten the display time between frames, thereby improving the smoothness of the video and achieving better visual and organoleptic effects.

In the process of inserting frames into video, optical flow estimation needs to be performed on intermediate frames. Most of the existing video frame inserting algorithms use bidirectional flow estimation, namely, firstly, bidirectional optical flows of two frames of pictures to be inserted are estimated, then, the optical flows are reversed, and an intermediate frame optical flow is obtained in a linear combination mode, but the problems that time is consumed and artifacts are easy to generate in a motion boundary area exist in the methods. In addition, the existing video frame inserting algorithm is usually in a single coding and decoding form, and cannot always show good effect on the frame inserting of large-amplitude actions.

Therefore, how to improve the effect of video interpolation is a problem to be solved.

Disclosure of Invention

The embodiment of the application provides a video frame inserting method, a video frame inserting device and a storage medium, which can improve the video frame inserting effect.

In a first aspect, an embodiment of the present application provides a video frame inserting method, where the method includes:

acquiring a first video frame and a second video frame in a target video, wherein the first video frame is a forward frame of the second video frame;

inputting the first video frame and the second video frame into a pre-trained first intermediate frame image generation model to perform feature fusion to obtain a first intermediate frame image between the first video frame and the second video frame, wherein the first intermediate frame image generation model comprises a linear combination network, an N-level feature extraction network and an N-level decoding network, and N is an integer greater than or equal to 2, wherein:

The i-th level feature extraction network is used for encoding an input image to obtain an image with a transformed size, and performing warp operation on the image with the transformed size and the image containing optical flow information output by the i+1st level decoding network, wherein the image with the transformed size is used as the image input into the i+1st level feature extraction network, and the image input into the 1 st level feature extraction network in the N-level feature extraction network is the first video frame and the second video frame;

the method comprises the steps that an i-th level decoding network is used for carrying out inverse transformation on a first image containing optical flow information and a second image not containing optical flow information, which are input, respectively obtaining a third image containing optical flow information after the first image is subjected to inverse transformation and a fourth image not containing optical flow information after the second image is subjected to inverse transformation, wherein the first image is an image obtained by carrying out warping operation on the optical flow information output by the i-th level feature extraction network and a fifth image containing optical flow information output by the i+1-th level decoding network, the second image is a sixth image not containing optical flow information output by the i+1-th level decoding network, the N-th level decoding network in the N-level decoding network is used for carrying out inverse transformation on the image with the transformed size output by the N-th level feature extraction network, and obtaining a seventh image and an eighth image after inverse transformation, wherein the seventh image is used as the image containing optical flow information output by the N-th level decoding network, and the eighth image is used as the image not containing optical flow information output by the N-th level decoding network;

The linear combination network is used for linearly combining the image containing optical flow information output by the 1 st stage decoding network in the N stage decoding network and the image not containing optical flow information to obtain the first intermediate frame image.

In computer vision, optical flow is used to characterize pixel-level motion in an image, either caused by camera movement or by movement of the object itself. Optical flow (also called optical flow field) refers to a set of pixel displacements between two adjacent frames of pictures, that is, a set of displacement vectors generated in the process of moving each pixel in a previous picture to a corresponding pixel position in a subsequent picture. Optical flow estimation is a classical problem in computer vision, and is a key step of many video understanding algorithms, such as video frame insertion, moving object detection, video content understanding, and the like, and often relies on accurate optical flow information.

The optical flow estimation may be classified into a sparse optical flow and a dense optical flow according to whether or not the image sparse points are selected for optical flow estimation. Where dense optical flow describes the optical flow of each pixel of the image moving to the next frame. The optical flow in the general context refers to dense optical flow, and the embodiment of the application is also a technical scheme aiming at dense optical flow.

In practical application, the optical flow estimation method is applied to an optical flow estimation model, but the optical flow estimation model generally adopts a bidirectional flow estimation method, namely, optical flows of front and rear two frames of images are respectively reversed and then linearly combined to obtain the optical flows of middle frames of the front and rear two frames of images.

However, the optical flow estimation model applied in the optical flow estimation method has certain problems, for example: 1. the parameter scale is larger, so that the training cost of the optical flow estimation model is larger and the operation efficiency is low; 2. in the optical flow estimation method, the robustness of the optical flow estimation model is limited by the picture scale of the target data set during model training, and when the optical flow scale requirement of a downstream task in butt joint with the optical flow estimation method is larger than the picture scale of the training data set, corresponding optical flow estimation cannot be performed based on the optical flow estimation model. Therefore, the robustness of the optical flow estimation method to optical flows with different scales is limited by the picture scale of the target data set during training, and a good generalization result cannot be obtained in practical application; 3. the optical flow estimation method can only obtain the unidirectional optical flow between adjacent frames once, and the bidirectional optical flow needs to be operated twice, so that the estimation of the bidirectional optical flow has the problems of low efficiency and incapability of meeting the real-time requirement.

The method provided in this embodiment of the present application is to fuse features in two input frames of images through the first intermediate frame image generation model to obtain a first intermediate frame image, where the features include optical flow features, unlike the bidirectional optical flow estimation described above, the embodiment of the present application uses intermediate flow estimation, that is, only one warp operation is needed in an optical flow limit state where an intermediate frame image is desired to be obtained.

And secondly, the first intermediate frame image generation model comprises a linear combination network, an N-level feature extraction network and an N-level decoding network, N is an integer larger than or equal to 2, in the process of applying the first intermediate frame image generation model, firstly, feature images with different sizes are generated based on input video frames, after the feature images with different sizes are decoded or warp, the obtained image containing optical flow information and the feature images with the other sizes are warp, so that the warp optical flow information can be attached when the feature images with the other sizes are decoded, parameters can be shared by each optical flow estimation, on one hand, the parameter of the model can be greatly compressed, on the other hand, multi-scale optical flow training is facilitated, the model can be trained or fine-tuned on data with lower resolution, but a good generalization result can be obtained on high-resolution pictures, and the robustness of the model to optical flows with different scales is very strong, so that the robustness of the model to optical flows with different scales is effectively improved.

In a further possible implementation manner of the first aspect, before acquiring the first video frame and the second video frame in the target video, the method further includes:

constructing a first training data set and a first verification data set of the first intermediate frame image generation model, wherein the first verification data set comprises a plurality of groups of image combinations comprising continuous three-frame images in the same video, the continuous three-frame images in the image combinations comprise a front frame image, an intermediate frame image and a rear frame image, and the first training data set comprises a plurality of groups of image combinations comprising the front frame image, the intermediate frame image and the rear frame image in the same video and an initial intermediate frame image;

taking a front frame image, a rear frame image and the initial intermediate frame image of the multi-group image combination in the first training data set as input of an initial first intermediate frame image generation model, and taking the intermediate frame image in the image combination as target output;

and verifying the trained initial first intermediate frame image generation model through the verification data set, and if the verification is passed, obtaining the first intermediate frame image generation model.

In this embodiment, in the process of training the first intermediate frame image generation model, an initial intermediate frame image is constructed in a training dataset, and the initial intermediate frame image is not constructed in a verification dataset, so that real-time intermediate stream estimation is conveniently realized, the network end-to-end learning intermediate frame optical stream is directly enabled, and the additional overhead caused by bidirectional stream estimation is reduced.

In a further possible implementation manner of the first aspect, the initial intermediate frame image is an image having a pixel value equal to a value of the time interval information and having a size equal to a size of a preceding frame image and a following frame image in the corresponding image combination.

The embodiment mainly aims at defining the initial intermediate frame image, wherein the initial intermediate frame image is an image with time interval information attribute, and for different time interval information t, the initial intermediate frame image is used as an additional channel to be sent into a convolutional neural network, so that a model can directly output the intermediate frame image by directly learning intermediate frame optical flow estimation end to end, and the final precision is improved.

In a further possible implementation manner of the first aspect, the calculation formula of the time interval information is as follows:

wherein t is time interval information, n0 is the frame number of the previous frame image in the corresponding video, n1 is the frame number of the middle frame image in the corresponding video, and n2 is the frame number of the following frame image in the corresponding video.

For three consecutive pictures in a video, the frame numbers of the three consecutive pictures are n0, n1, n2.. It should be noted that, in the process of verifying the test model, no additional time information is input, and the frame to be inserted is directly input.

In a further possible implementation manner of the first aspect, the image after the size transformation is a pyramid feature image with different scales obtained by the ith level feature extraction network according to a pyramid feature extraction method.

A pyramid of one image is a series of image sets that are arranged in a pyramid shape with progressively lower resolution and that are derived from the same original image. It is obtained by downsampling a step down and does not stop sampling until a certain termination condition is reached. The bottom of the pyramid is a high resolution representation of the image to be processed, while the top is an approximation of the low resolution. We metaphe a layer-by-layer image into a pyramid, the higher the level, the smaller the image and the lower the resolution. The multi-scale optical flow training is convenient, and the model can be trained or fine-tuned on the data with lower resolution through combining the image pyramid with the recursive optical flow estimation, but can obtain a very good generalization result on the high-resolution picture, and has very strong robustness on optical flows with different scales, so that the robustness of the model on the optical flows with different scales is effectively improved.

In a further possible implementation manner of the first aspect, after inputting the first video frame and the second video frame into a pre-trained first intermediate frame image generation model for feature fusion, the method further includes:

Constructing an initial convolutional neural network of a second intermediate frame image generation model;

constructing a second training data set and a second verification data set;

training the initial convolutional neural network through the second training data set to obtain a trained convolutional neural network;

verifying the trained convolutional neural network through the second verification data set, and if the verification is passed, obtaining a second intermediate frame image generation model;

inputting the first video frame and the second video frame into the second intermediate frame image generation model to obtain a second intermediate frame image;

fusing the second intermediate frame image and the first intermediate frame image to obtain a final intermediate frame image

In order to achieve better frame inserting effect, the second intermediate frame image generation model is constructed, wherein the second intermediate frame image generation model can be realized based on intermediate flow estimation or bidirectional optical flow estimation.

Furthermore, the intermediate frame images obtained by the first intermediate frame image generation model and the second intermediate frame image generation model are fused, and the fusion can be performed based on different layers, for example, only optical flow fusion or local area fusion is performed, so that the effect of avoiding the obtained intermediate frame images is better, and the effect of inserting frames is further improved.

In a further possible implementation manner of the first aspect, the first intermediate frame image generation model includes a linear combination network, an N-level feature extraction network, and an N-level decoding network, N being an integer greater than or equal to 2; wherein:

the N-level feature extraction network consists of two layers of convolutions, wherein the convolution kernel of the first layer of convolutions is 3*3, the step length is 2, the padding value is 1, the convolution kernel of the second layer of convolutions is 3*3, the step length is 1, and the padding value is 1;

the N-level decoding network comprises a multi-layer decoder, wherein the decoder consists of a convolution layer, a residual block and a deconvolution layer, the convolution kernel of the convolution layer is 3*3, the step length is 1, and the padding is 1; the residual block consists of 5 layers of convolutions, the convolution kernel of each layer of convolutions is 3*3, the step length is 1, and the padding is 1; the deconvolution layer has a convolution kernel size of 4*4, a step size of 2, and a padding of 1.

The key point of the embodiment is that the first intermediate frame image generation model uses 3x3 convolution operators friendly to each hardware platform as much as possible, in the process of applying the model, the optical flow is estimated on a small graph in sequence, then the optical flow is refined on a large graph, the effect of optical flow estimation is improved, and the method provided by the embodiment of the application is more suitable.

In a further possible implementation manner of the first aspect, the second building training data set and the second verification data set comprises:

acquiring a sample video, wherein the sample video is a video retaining an original frame rate;

cutting the sample video to obtain a plurality of groups of image combinations containing continuous three-frame images, wherein the continuous three-frame images in the image combinations comprise a front frame image, an intermediate frame image and a rear frame image;

taking a front frame image and a rear frame image of a plurality of groups of image combinations in the training data set as input of an initial convolutional neural network of the second intermediate frame image generation model, and taking the intermediate frame image in the image combinations as target output;

optimizing the initial convolutional neural network by constructing a target loss function, and performing iterative training on the initial convolutional neural network through the training data set to obtain a trained cyclic convolutional neural network.

The present embodiment is mainly used for explaining the training process of the second intermediate frame image generation model. The method applied in the embodiment is to construct feature pyramids based on original input pictures of two adjacent frames respectively, gradually reduce feature sizes along with the increase of the number of pyramid layers, and then input feature data of the 0 th layer of the feature pyramids into the second intermediate frame image generation model for processing to obtain an optical flow estimated value. Optionally, the second intermediate frame image generation model is a Convolutional Neural Network (CNN) and is composed of a Warping layer (Warping layer) and an optical flow estimation layer.

In a second aspect, an embodiment of the present application provides a video frame inserting apparatus, where the apparatus includes at least an acquisition unit and a first input unit. The video frame inserting device is configured to implement the method described in any implementation manner of the first aspect, where the acquiring unit and the first input unit are described as follows:

an acquisition unit configured to acquire a first video frame and a second video frame in a target video, the first video frame being a forward frame of the second video frame;

the first input unit is configured to input the first video frame and the second video frame to a pre-trained first intermediate frame image generation model for feature fusion, so as to obtain a first intermediate frame image between the first video frame and the second video frame, where the first intermediate frame image generation model includes a linear combination network, an N-level feature extraction network, and an N-level decoding network, N is an integer greater than or equal to 2, where:

However, the optical flow estimation model applied in the optical flow estimation method has certain problems, such as 1, larger parameter scale, larger training cost and low operation efficiency; 2. in the optical flow estimation method, the robustness of the optical flow estimation model is limited by the picture scale of the target data set during model training, and when the optical flow scale requirement of a downstream task in butt joint with the optical flow estimation method is larger than the picture scale of the training data set, corresponding optical flow estimation cannot be performed based on the optical flow estimation model. Therefore, the robustness of the optical flow estimation method to optical flows with different scales is limited by the picture scale of the target data set during training, and a good generalization result cannot be obtained in practical application; 3. the optical flow estimation method can only obtain the unidirectional optical flow between adjacent frames once, and the bidirectional optical flow needs to be operated twice, so that the estimation of the bidirectional optical flow has the problems of low efficiency and incapability of meeting the real-time requirement.

In a further possible implementation manner of the second aspect, the apparatus further includes:

a first construction unit configured to construct a first training data set and a first verification data set of the first intermediate frame image generation model, where the first verification data set includes a plurality of sets of image combinations including consecutive three frame images in the same video, the consecutive three frame images in the image combinations including a front frame image, an intermediate frame image, and a rear frame image, and the first training data set includes a plurality of sets of image combinations including the front frame image, the intermediate frame image, and the rear frame image in the same video, and an initial intermediate frame image;

A second input unit, configured to take a previous frame image, a subsequent frame image, and the initial intermediate frame image of the combination of the multiple groups of images in the first training data set as input of an initial first intermediate frame image generation model, and take an intermediate frame image in the combination of the images as a target output;

and the verification unit is used for verifying the trained initial first intermediate frame image generation model through the verification data set, and if the verification is passed, the first intermediate frame image generation model is obtained.

the second construction unit is used for constructing an initial convolutional neural network of a second intermediate frame image generation model;

a third construction unit for constructing a second training data set and a second verification data set;

The training unit is used for training the initial convolutional neural network through the second training data set to obtain a trained convolutional neural network;

the verification unit is used for verifying the trained convolutional neural network through the second verification data set, and if the verification is passed, the second intermediate frame image generation model is obtained;

the third input unit is used for inputting the first video frame and the second video frame into the second intermediate frame image generation model to obtain a second intermediate frame image;

and the fusion unit is used for fusing the second intermediate frame image and the first intermediate frame image to obtain a final intermediate frame image.

In a further possible implementation manner of the second aspect, the third building unit is specifically configured to:

The present embodiment is mainly used for explaining the training process of the second intermediate frame image generation model. The method applied in the embodiment is to construct feature pyramids based on original input pictures of two adjacent frames respectively, gradually reduce feature sizes along with the increase of the number of pyramid layers, and then input feature data of the 0 th layer of the feature pyramids into the second intermediate frame image generation model for processing to obtain an optical flow estimated value. The second intermediate frame image generation model is a Convolutional Neural Network (CNN), and is composed of a Warping layer (Warping layer) and an optical flow estimation layer.

In a third aspect, embodiments of the present application provide a video framing device, where the video framing device includes a processor, a memory, and a communication interface; a memory having a computer program stored therein; the communication interface, when executed by a processor, is configured to transmit and/or receive data, and the video plug-in device may perform the method described in the foregoing first aspect or any of the possible implementations of the first aspect.

The processor included in the video frame inserting apparatus described in the third aspect may be a processor dedicated to performing the methods (referred to as a dedicated processor for convenience of distinction), or may be a processor that performs the methods by calling a computer program, such as a general-purpose processor. In the alternative, the at least one processor may also include both special purpose and general purpose processors.

Alternatively, the above-mentioned computer program may be stored in a memory. For example, the Memory may be a non-transitory (non-transitory) Memory, such as a Read Only Memory (ROM), which may be integrated on the same device as the processor, or may be separately disposed on different devices, and the type of the Memory and the manner in which the Memory and the processor are disposed in the embodiments of the present application are not limited.

In one possible implementation, the at least one memory is located outside the video plug-in device.

In yet another possible embodiment, the at least one memory is located within the video framing device.

In yet another possible embodiment, a portion of the at least one memory is located within the video framing device and another portion of the at least one memory is located outside the video framing device.

In this application, the processor and the memory may also be integrated in one device, i.e. the processor and the memory may also be integrated together.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having a computer program stored therein, which when executed on at least one processor, implements the method described in the foregoing first aspect or any of the alternatives of the first aspect.

In a fifth aspect, the present application provides a computer program product comprising a computer program for implementing the method of the first aspect or any of the alternatives of the first aspect, when said program is run on at least one processor.

Alternatively, the computer program product may be a software installation package, which may be downloaded and executed on a computing device in case the aforementioned method is required.

The technical solutions provided in the third to fifth aspects of the present application may refer to the beneficial effects of the technical solutions in the first aspect and the second aspect, and are not described herein again.

Drawings

The drawings that are used in the description of the embodiments will be briefly described below.

Fig. 1 is a schematic diagram of a network architecture to which a video frame inserting method provided in an embodiment of the present application is applicable;

fig. 2 is a schematic flow chart of a video frame inserting method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a video frame inserting method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an encoding encoder network according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a decoder according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a video frame inserting device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a video frame inserting device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

The terms "first," "second," "third," and "fourth" and the like in the description and in the claims of this application and in the drawings, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

The following describes a system architecture applied to the embodiment of the present application. It should be noted that, the system architecture and the service scenario described in the present application are for more clearly describing the technical solution of the present application, and do not constitute a limitation on the technical solution provided in the present application, and those skilled in the art can know that, with the evolution of the system architecture and the appearance of the new service scenario, the technical solution provided in the present application is also applicable to similar technical problems.

Fig. 1 is a schematic diagram of a network architecture to which a video frame inserting method according to an embodiment of the present application is applicable. Referring to fig. 1, the network architecture includes: the server 11 and the terminal device 12, and the terminal device 12 and the server 11 establish a network connection, which may be a wired communication connection, a wireless communication connection, or an optical fiber cable, or the like.

The server 11 has a huge computing power, storage power, etc., and is capable of providing services to the terminal device 12. The server 11 may be hardware or software. When the server 11 is hardware, the server 11 is a single server or a distributed server cluster composed of a plurality of servers. When the server 11 is software, it may be a plurality of software modules or a single software module, etc., and the embodiment of the present application is not limited.

The server 11 is provided with a first intermediate frame image generation model and a second intermediate frame image generation model which are obtained based on deep learning, and can segment an input video to obtain a video group containing adaptive frames, wherein the video group comprises a first video frame and a second video frame, and the first video frame is a forward frame of the second video frame; the server 11 may generate a first intermediate frame image and a second intermediate frame image for each packet, and further, the server 11 may fuse the first intermediate frame image and the second intermediate frame image to obtain a target intermediate frame image, and interpolate the packets according to the target intermediate frame image.

The terminal device 12 may be, for example, a cell phone, tablet computer, electronic book reader, laptop, desktop computer, etc., and embodiments of the present application are not limited.

In the video processing process, a user selects a video to be input to the server 11 on the terminal device 12, sends indication information to the server 11, the indication information is used for indicating a source video, and after receiving the indication information, the server 11 acquires the source video and performs frame inserting processing on the source video to obtain a target video.

In the embodiment of the application, besides converting the video with low frame rate into the video with high frame rate through the frame insertion, slow motion, video compression, training data generation and the like can be generated by utilizing the frame insertion. Wherein the training data is used to train the deblurring model. After training the deblurring model, inputting a blurred motion video to obtain a clear motion video; alternatively, a blurred image is input to the deblurring model, and a clear image can be obtained. The video frame inserting method provided by the embodiment of the application runs on the high-resolution video at real-time speed, so that the player can play the video with higher frame rate, provide video editing service for users and the like.

Although the architecture shown in fig. 1 is described, the server 11 performs video interpolation on the input video. However, the embodiment of the present application is not limited thereto, and in other possible implementations, the first intermediate frame image generating model and the second intermediate frame image generating model may be deployed on the terminal device 12, where the terminal device 12 does not need to instruct the server 11 about the source video, but performs frame interpolation processing on the source video with a certain frame rate locally according to the instruction information input by the user, so as to obtain the target video with a higher frame rate.

At present, the related modes for generating the intermediate frame image mainly comprise the following two modes: in one mode, a video frame inserting method based on a stream is generally to estimate bidirectional optical flow between two adjacent frames in a video, obtain an approximate intermediate stream based on the bidirectional optical flow in a linear combination mode, and finally determine an intermediate frame image according to the approximate intermediate stream. A second mode is that a real-time intermediate stream estimation algorithm directly estimates intermediate streams from coarse to fine through an intermediate stream model iFNet; then, the input frame is warped according to the estimated intermediate stream, and the warped input frame and intermediate stream are fused and perfected using a Convolutional Neural Network (CNN), etc., thereby obtaining an intermediate frame image.

However, in the first mode, the approximate intermediate stream is obtained by the linear combination mode based on the bidirectional optical flow, but in practical application, the blocking and the complex nonlinear motion cannot be directly simulated from the video by the linear combination mode based on the bidirectional optical flow, so that in the first mode, artifacts/distortion are easily caused at the motion boundary of the intermediate frame image. Referring to fig. 2, fig. 2 is a flowchart of a video frame inserting method according to an embodiment of the present application, where the video frame inserting method may be implemented based on the system architecture shown in fig. 1, or may be implemented based on other architectures, and the method includes, but is not limited to, the following steps:

Step S201: and acquiring a first video frame and a second video frame in the target video.

The first video frame is a forward frame of the second video frame;

the target video may be a video, a dynamic image, or other resource having N frames of images, where N is an integer greater than 1, and the frame insertion method provided in the embodiments of the present application will be described mainly by taking the video resource as an example.

In this embodiment, the target video received may be a relatively low frame rate video, for example, a low frame rate video having a transmission frame number (Frames Per Second, fps) of 30 per second. If the low frame rate video is to be converted into a 60fps high frame rate video, an intermediate frame may be inserted between every two adjacent frames of the low frame rate video, i.e., the frame insertion operation may be performed.

In the low-frame-rate video, the first frame image is a first video frame, and the second frame image is a second video frame, so that a new third image is inserted between the first video frame and the second video frame.

Step S202, inputting the first video frame and the second video frame into a pre-trained first intermediate frame image generation model to perform feature fusion, so as to obtain a first intermediate frame image between the first video frame and the second video frame.

In this embodiment, the first intermediate frame image generation model includes a linear combination network, an N-level feature extraction network, and an N-level decoding network, where N is an integer greater than or equal to 2, where:

In an optional implementation manner, the operation of the first intermediate frame image generation model is illustrated in fig. 3, fig. 3 is a schematic diagram of a video interpolation method provided in the embodiment of the present application, in fig. 3, n=3, that is, the first intermediate frame image generation model includes a linear combination network, a 3-level feature extraction network, and a 3-level decoding network, specifically as follows:

as shown in fig. 3, pyramid feature extraction is first performed using a feature extraction network;

for inputting the first video frame i ₀ And said second video frame i ₁ Pyramid feature extraction, namely encoding (encoding) process is carried out to obtain pyramid feature images i with different scales after the transformation of the size ¹ ₀ And i ¹ ₁ ，i ² ₀ And i ² ₁ ，i ³ ₀ And i ³ ₁ The pyramid characteristic images with different scales are a first video frame and a second video frame with different resolutions, i ¹ ₀ And i ¹ ₁ The first video frame and the second video frame which are respectively output by the first-stage feature extraction network and have the same size are the second images without optical flow information, i ² ₀ And i ² ₁ The first video frame and the second video frame which are respectively output by the second-stage feature extraction network and have the same size are the second images without optical flow information, i ³ ₀ And i ³ ₁ The first video frame and the second video frame with the same size output by the third-level feature extraction network are the second image without optical flow information, but it should be noted that the image without optical flow information indicates an image without explicitly indicating or extracting optical flow information, and also indicates an image without warp operation, and no optical flow exists in the image.

The dimensions of the pyramid feature images of different resolutions are illustrated by the four-level pyramid shown in FIG. 3, alternatively the resolution of the first level is the original resolution, e.g., i ₀ And i ₁ The resolution of the second stage is 1/2, e.g. i ¹ ₀ And i ¹ ₁ The resolution of the third stage is 1/4, e.g. i ² ₀ And i ² ₁ The fourth time the resolution is 1/8, e.g. i ³ ₀ And i ³ ₁ 。

It should be noted that the above exemplary resolution is only an exemplary illustration of the embodiments of the present specification, and is not intended to limit the present specification in any way.

Further, the encoder of the feature extraction network is composed of two layers of convolutions, the structure schematic diagram of which is shown in fig. 4, and fig. 4 is a structure schematic diagram of an encoding encoder network provided in the embodiment of the present application, where the encoder network is the feature extraction network;

The convolution kernel size of the first layer convolution is 3×3, the step size is 2, the padding value is 1, the convolution kernel size of the second layer convolution is 3×3, the step size is 1, the padding value is 1, and the activation is performed by using the pralu after each convolution is finished, and Conv (Cin, cmid,3,2, 1), pralu (Cmid), conv (Cmid, cout,3, 1), and pralu (Cout) shown in fig. 4 are code function representations of the feature extraction network in this embodiment.

Optionally, considering that when the first intermediate frame image is generated in this embodiment, a convolution operation needs to be performed on the first video frame and the second video frame of the frame to be interpolated, where the convolution operation may make the image smaller and smaller, and some boundary information of the image is lost, so that the corner and boundary information of the image play a less role. Thus, to avoid this problem, in one or more embodiments, the padding operation may be performed on the first video frame and the second video frame of the frame to be interpolated, where the padding operation is to interpolate "0" to the image before the convolution operation, and may be understood as adding a grid around the image blocks of the first video frame and the second video frame of the frame to be interpolated, to ensure that the size of the image samples of the frame to be interpolated does not change after the convolution, and to enable the edge data of the first video frame and the second video frame of the frame to be interpolated to be utilized, thereby better expanding the edge features of the entire image. For example, assuming that the resolution of the first video frame and the second video frame of a certain frame to be inserted is 1920×1080, the heights may be padded to 1088 such that the resolution of the first video frame and the second video frame of the frame to be inserted is 1920×1088.

It should be noted that the foregoing exemplary heights ranging from 1088 are only one exemplary illustration of the embodiments of the present disclosure, and are not meant to limit the present disclosure in any way.

Secondly, decoding using a decoding (decoder) network;

the feature extraction network firstly uses pyramid feature image i ³ ₀ And i ³ ₁ Input to a third level decoding network which decodes the received pyramid feature image i ³ ₀ And i ³ ₁ Decoding to obtain an image with a reverse size, wherein the image with the reverse size is the eighth image; the third-stage decoding network inputs the reverse-sized image into a third-stage feature extraction network and a second-stage encoding network;

the third-stage feature extraction network is used for extracting i according to input ² ₀ 、i ² ₁ Performing warp operation on the image with the reverse size to obtain an image containing optical flow information, and then sending the image containing optical flow information to the second-stage coding network;

decoding the image containing optical flow information and the image not containing optical flow information input by the second-stage coding network to obtain an image containing optical flow information and an image not containing optical flow information with a reverse transformation size, wherein the image containing optical flow information with the reverse transformation size can be the fifth image, and the image not containing optical flow information with the reverse transformation size can be the sixth image; the second-stage encoding network inputs the image containing optical flow information with the reverse transformation size to a second-stage feature extraction network, and inputs the image without optical flow information with the reverse transformation size to a first-stage decoding network;

The second-stage feature extraction network outputs the obtained image containing optical flow information with the reverse transformation size and i output by the first-stage feature extraction network ¹ ₀ And i ¹ ₁ Performing warp operation to obtain an image containing optical flow information and further refining the optical flow information, and transmitting the image to a first-stage decoding network;

the first-stage decoding network decodes the obtained image containing optical flow information and further refined to the optical flow information, and the image containing no optical flow information with the inverse transformation size to obtain an image containing optical flow information with a re-inverse transformation size and an image containing no optical flow information with a re-inverse transformation size, wherein the image containing optical flow information with the re-inverse transformation size can be the first image or the third image, and the image containing no optical flow information with the re-inverse transformation size can be the second image or the fourth image; the first-stage decoding network inputs the image containing optical flow information with the re-reverse transformation size to a first-stage feature extraction network, and inputs the image containing no optical flow information with the re-reverse transformation size to a linear combination network;

the first-stage feature extraction network transforms the image containing optical flow information with the size and the first video frame i ₀ And said second video frame i ₁ And performing warp operation to obtain a final image with the most abundant optical flow details, and inputting the final image containing optical flow information and with the most abundant optical flow details into a linear combination network.

Alternatively, the image of the reverse transformation size and the image of the reverse transformation size again without optical flow information may include two images, that is, an image obtained based on the first video frame and the second video frame, but the image containing optical flow information may include only one image, that is, an image containing intermediate optical flow obtained from the first video frame and the second video frame.

It should be noted that, as can be seen from fig. 3, after each decoding is finished, the output image containing the optical flow information and the features of the previous layer are subjected to warp operation to obtain the intermediate optical flow of the previous layer, and the intermediate optical flow is used as the input of the decoder of the next layer to provide anchor point information for the decoding network of the next layer, so as to promote the intermediate flow estimation of the next layer, and the obtained intermediate optical flow is clearer and sharper; and meanwhile, the optical flow after linear combination has less overlapping and blurring phenomena.

Further, in this embodiment, a decoder is disposed in each of the three-stage decoding networks, where the decoder is formed by two-layer convolution, and a schematic structure of the decoder is shown in fig. 5, and fig. 5 is a schematic structure of a decoder provided in an embodiment of the present application;

The decoder consists of a convolutional layer, a residual block and a deconvolution layer, first the convolutional layer with a convolution kernel size of 3 x 3, a step size of 1, and a padding of 1 is used for feature extraction, and the PReLu activation function is used. And extracting features through a residual block, wherein the residual block consists of 5 layers of convolutions, the convolution kernel size is 3 multiplied by 3, the step size is 1, the padding is 1, a PReLU activation function is used after the convolution of the first 4 layers is finished, and the PReLU activation function is fused with residual input after the convolution of the 5 th layer is finished, and then the PReLU activation function is activated. Finally, deconvolution operation with convolution kernel size of 4×4, step size of 2, and padding of 1 is used, so that the output size is the same as the previous layer size.

Note that Conv (Cin, cmid,3, 1), prilu (Cmid), demanv (Cin, cout,4,2, 1), prilu (Cout), conv (Cin, cmid,3, 1), conv (Cin, C1,3, 1), etc. shown in fig. 5 are all code functions of the decoding network in this embodiment, and specifically, resBlock is a residual block, cin×h×w indicates an input image and a size, cout×h×w indicates an output image and a size identical to the input, and cout×h/2*W/2 indicates that the size of the output image is 1/2 of the input.

Finally, the linear combination network carries out linear combination merge on the input image containing optical flow information and the image not containing optical flow information so as to obtain a first intermediate frame image; and after the decoding of the last layer is finished, linearly combining the intermediate optical flow after warp and the image output by decoding to obtain a final first intermediate frame image.

In an optional implementation manner, the training method of the first intermediate frame image generation model is as follows:

a first step of constructing a first training data set and a first verification data set of the first intermediate frame image generation model, wherein the first verification data set comprises a plurality of groups of image combinations comprising continuous three-frame images in the same video, the continuous three-frame images in the image combinations comprise a front frame image, an intermediate frame image and a rear frame image, and the first training data set comprises a plurality of groups of image combinations comprising the front frame image, the intermediate frame image and the rear frame image in the same video, and an initial intermediate frame image;

in an alternative embodiment, the initial intermediate frame image is an image having the same size as the previous frame image and the subsequent frame image in the corresponding image combination, and the pixel value is equal to the value of the time interval information.

The application process of the initial intermediate frame image in this embodiment is described by taking the first video frame i0 and the second video frame i1 as input frames output to the first intermediate frame image generation model as an example, and for the first video frame i0 and the second video frame i1, a picture frame it at an intermediate time is pre-estimated, where the intermediate time t is 0.5. The network input is a first video frame i0 and the second video frame i1, and the time information t with the pixel value of 0.5 is the same as the sizes of i0 and i1, and t is an initial intermediate frame image with a time attribute.

The calculation formula of the time interval information is as follows:

A second step of taking a front frame image, a rear frame image and the initial intermediate frame image of the combination of a plurality of groups of images in the first training data set as input of an initial first intermediate frame image generation model and taking the intermediate frame image in the combination of images as target output;

and thirdly, verifying the trained initial first intermediate frame image generation model through the verification data set, and if the verification is passed, obtaining the first intermediate frame image generation model.

In the verification process, the loss function and parameter back transmission need to be minimized, and the specific calculation process is as follows:

L _total ＝λ ₁ L ₁ +λ ₂ L _p

L ₁ ＝||I _t -I _gt || ₁

L _p ＝||φ(I _t )-φ(I _gt) || ₂

L _total l is the loss function of the whole network ₁ Output I for network _t And true value I _gt L of making ₁ Loss, L _p To perceive loss, where φ (-) represents a network of vgg-19 pre-trained on an ImageNet dataset, λ ₁ 、λ ₂ Is a constant set in advance. Minimizing L during training _total And updating the gradient, and reversely transmitting the parameters until the network converges.

It should be noted that, during the testing process, no additional time information is input, and the first video frame i of the frame to be inserted is directly input ₀ And said second video frame i ₁ The real-time intermediate stream estimation can be realized, the intermediate frame optical flow can be directly learned from the network end to end, and the additional overhead caused by the bidirectional stream estimation is avoided.

In order to obtain a higher frame rate or higher definition intermediate frame image, in an alternative embodiment, the following is specific:

constructing an initial convolutional neural network of a second intermediate frame image generation model; constructing a second training data set and a second verification data set; training the initial convolutional neural network through the second training data set to obtain a trained convolutional neural network; verifying the trained convolutional neural network through the second verification data set, and if the verification is passed, obtaining a second intermediate frame image generation model; inputting the first video frame and the second video frame into the second intermediate frame image generation model to obtain a second intermediate frame image; and fusing the second intermediate frame image and the first intermediate frame image to obtain a final intermediate frame image.

In this embodiment, the convolutional neural network is a deep neural network with a convolutional structure, which can reduce the memory occupied by the deep network, and has three key operations, namely local receptive field, weight sharing, and pooling stage, so that the number of parameters of the network is effectively reduced, and the over-fitting problem of the model is relieved.

Convolutional neural network overall architecture: the convolutional neural network is a multi-stage supervised learning neural network, and the convolutional stage and the pool sampling stage of the implicit stage are core modules for realizing the characteristic extraction function of the convolutional neural network. The network model adopts a gradient descent method to minimize a loss function to reversely adjust weight parameters in the network step by step, and improves the precision of the network through frequent iterative training. The low hidden level of the convolutional neural network is formed by alternating a convolutional level and a maximum pool sampling level, and the high level is an hidden level and a logistic regression classifier of which the full-connection level corresponds to the traditional multi-level perceptron. The input to the first fully connected stage is a feature image obtained by feature extraction by the convolution stage and the sub-sampling stage. The final stage output stage is a classifier, which can use logistic regression, softmax regression, or even support vector machine to classify the input image.

The convolutional neural network structure comprises: convolution stage, sampling stage, full chaining stage. Each stage has a plurality of feature maps, each feature map having a plurality of neurons, each feature map extracting a feature of the input by a convolution filter.

After the input image statistics and the filter are convolved, the local feature is extracted, once the local feature is extracted, the position relation between the local feature and other features is also determined, the input of each neuron is connected with the local receptive field of the previous stage, each feature extraction stage is followed by a calculation stage for carrying out local average and secondary extraction, the calculation stage is also called a feature mapping stage, each calculation stage of the network consists of a plurality of feature mapping planes, and weights of all neurons on the planes are equal.

The mapping of the input stage to the concealment stage is usually referred to as a feature mapping, i.e. the feature extraction stage is obtained by means of a convolution stage, after which the feature mapping stage is obtained by means of a pooling.

The convolutional neural network has the advantages in image understanding compared with the general neural network that:

1) The network structure can be well adapted to the structure of the image;

2) Simultaneously extracting and classifying the features, so that the feature extraction is beneficial to the feature classification;

3) The weight sharing can reduce training parameters of the network, so that the neural network structure becomes simple and the adaptability is stronger.

Optionally, the second intermediate frame image generation model is a Convolutional Neural Network (CNN) and is composed of a Warping layer (Warping layer) and an optical flow estimation layer.

The key point of the second intermediate frame image generation model in this embodiment is that the correlation between multiple continuous frames is fully utilized by using a refinement frame from thick to thin, so as to improve the quality of the produced second intermediate frame, and finally, the first intermediate frame image and the second intermediate frame image are fused to obtain a target intermediate frame image, where the fusion is mainly fusion or superposition of optical flow information.

In summary, in the embodiment of the present application, a first video frame and a second video frame of a frame to be inserted are input into a first intermediate frame image generation model to obtain an intermediate frame image between the first video frame and the second video frame, in the training process of the first intermediate frame image generation model, real-time intermediate flow estimation is adopted, and for different target moments t, the first intermediate frame image and the second intermediate frame image are used as network inputs, so that the network end-to-end learning intermediate frame optical flow is directly enabled, and the overhead caused by bidirectional flow estimation is reduced; and extracting pyramid feature images with different sizes in the model prediction process, decoding features with different scales, gradually refining from a coarse optical flow with minimum resolution to a fine optical flow with maximum resolution, improving the optical flow presentation effect of the intermediate frame image, and finally taking the optical flow estimated according to each layer of feature image as the input of a next layer decoding network in the prediction process to provide anchor point information for the next layer decoding network, promoting the estimation of each layer of intermediate optical flow and improving the final frame inserting effect.

The foregoing details the method of embodiments of the present application, and the apparatus of embodiments of the present application is provided below.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a video frame inserting apparatus 60 according to an embodiment of the present application, where the video frame inserting apparatus 60 may be a server or a device in the server, and the video frame inserting apparatus 60 may include an obtaining unit 601 and a first input unit 602, where detailed descriptions of the respective units are as follows.

An obtaining unit 601, configured to obtain a first video frame and a second video frame in a target video, where the first video frame is a forward frame of the second video frame;

the first input unit 602 is configured to input the first video frame and the second video frame to a pre-trained first intermediate frame image generation model for feature fusion, to obtain a first intermediate frame image between the first video frame and the second video frame, where the first intermediate frame image generation model includes a linear combination network, an N-level feature extraction network, and an N-level decoding network, N is an integer greater than or equal to 2, where:

In one possible embodiment, the apparatus 60 further comprises:

In one possible embodiment, the third building element is specifically configured to:

Referring to fig. 7, fig. 7 is a schematic structural diagram of a video frame inserting apparatus 70 according to an embodiment of the present application, where the video frame inserting apparatus 70 may be the server or a device in the server, and the video frame inserting apparatus 70 includes: a processor 701, a communication interface 702 and a memory 703. The processor 701, the communication interface 702, and the memory 703 may be connected by a bus or other means, which is exemplified in the embodiment of the present application.

The processor 701 is a computing core and a control core of the video plug-in device 70, and may parse various instructions in the video plug-in device 70 and various data of the video plug-in device 70, for example: the processor 701 may be a central processing unit (Central Processing Unit, CPU) that may transfer various types of interaction data between the internal structures of the video plug-in device 70, and so on. Communication interface 702 may optionally include a standard wired interface, a wireless interface (e.g., wi-Fi, mobile communication interface, etc.), and may be controlled by processor 701 to receive and transmit data; the communication interface 702 may also be used for transmission or interaction of signaling or instructions within the video plug-in device 70. A Memory 703 (Memory) is a Memory device in the video insertion device 70 for storing programs and data. It will be appreciated that the memory 703 here may include both a built-in memory of the video plug-in device 70 and an extended memory supported by the video plug-in device 70. The memory 703 provides a storage space storing the operating system of the video plug-in device 70, and also storing program code or instructions required by the processor to perform the corresponding operations, and optionally, related data generated by the processor after performing the corresponding operations.

In the present embodiment, the processor 701 executes executable program code in the memory 703 for performing the following operations:

In an alternative, the processor 701 is further configured to:

Constructing a second training data set and a second verification data set;

and fusing the second intermediate frame image and the first intermediate frame image to obtain a final intermediate frame image.

In one alternative, the processor 701 is specifically configured to, in constructing the second training data set and the second validation data set:

Embodiments of the present application provide a computer readable storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to implement operations performed by a server in the embodiments described above.

Embodiments of the present application also provide a computer program product that, when run on a processor, implements the operations performed by the server in the embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by a program that instructs related hardware, and the program may be stored in a computer-readable storage medium, and the program may include the above-described embodiment methods when executed. And the aforementioned storage medium includes: various media capable of storing program code, such as ROM, RAM, magnetic or optical disks.

Claims

1. A method of video framing, the method comprising:

2. The method of claim 1, further comprising, prior to acquiring the first video frame and the second video frame in the target video:

3. The method of claim 2, wherein the initial intermediate frame image is an image of the same size as the preceding frame image and the following frame image in the corresponding image combination, the pixel value being of the value size of the time interval information.

4. A method according to claim 3, wherein the time interval information is calculated as follows:

5. The method according to claim 1, wherein the image after the size transformation is a pyramid feature image with different scales obtained by the i-th level feature extraction network according to a pyramid feature extraction method.

6. The method of claim 1, wherein after inputting the first video frame and the second video frame into a pre-trained first inter-frame image generation model for feature fusion, obtaining a first inter-frame image between the first video frame and the second video frame, the method further comprises:

constructing a second training data set and a second verification data set;

7. The method of any of claims 1-6, wherein the first intermediate frame image generation model comprises a linear combination network, an N-level feature extraction network, and an N-level decoding network, N being an integer greater than or equal to 2; wherein:

8. A video framing apparatus, the apparatus comprising:

9. A video-in-frame device, characterized in that it comprises at least one processor, a communication interface for transmitting and/or receiving data, and a memory for storing a computer program, said at least one processor being adapted to invoke the computer program stored in the at least one memory for implementing the method according to any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when run on a processor, implements the method according to any of claims 1-7.