CN114882416A

CN114882416A - Video frame synthesis method, device, equipment and storage medium

Info

Publication number: CN114882416A
Application number: CN202210547819.4A
Authority: CN
Inventors: 程辉; 刘松鹏; 阮哲; 王立学
Original assignee: China Mobile Communications Group Co Ltd; MIGU Video Technology Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Video Technology Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2022-08-09

Abstract

The invention discloses a video frame synthesis method, a video frame synthesis device, video frame synthesis equipment and a storage medium. Inputting a video frame sequence into a preset mixed space-time convolutional network to obtain semantic features of the video frame sequence under different space-time scales; performing feature fusion on the semantic features to obtain fused semantic features; and determining a video frame synthesis result according to the fused semantic features. The invention obtains the semantic features of the video frame sequence under different space-time scales, performs feature fusion on the semantic features to obtain fused semantic features, and determines the video frame synthesis result according to the fused semantic features. The above approach of the present invention enables the synthesis of a higher quality synthesized video frame than prior approaches that rely on the intensive estimation of motion existing between given video frames to synthesize the video frames.

Description

Video frame synthesis method, device, equipment and storage medium

Technical Field

The present invention relates to the field of video frame synthesis technologies, and in particular, to a video frame synthesis method, apparatus, device, and storage medium.

Background

The basic idea of video frame synthesis is to interpolate and extrapolate frames on the basis of existing video frames by estimating the motion relationships between given video frames. Wherein the interpolation of a video frame refers to the composition of a respective frame or frames between a given two frames or two video segments, such that a visually smooth and reasonable transition can be obtained for the given two frames or segments; the extrapolation of video frames is to synthesize a corresponding video frame or segment from the existing video content at the beginning or end of a given segment of video, so that the synthesized segment can be visually used as the content before or after the beginning or end of the existing video.

The traditional video frame synthesis method mainly depends on densely estimating the motion existing between given video frames, such as dense optical flow, and the like, and then performing corresponding deformation and transformation on the input video frames according to the estimated motion relation, and finally realizing the interpolation and extrapolation of the video frames. The performance of such methods using optical flow to guide video frame synthesis depends greatly on the quality of optical flow estimation, and the edges of moving objects in the finally synthesized video frames often have obvious artifacts, which seriously affect the visual quality of the synthesized frames.

Disclosure of Invention

The invention mainly aims to provide a video frame synthesis method, a video frame synthesis device, video frame synthesis equipment and a storage medium, and aims to solve the technical problem that the synthesized video frame in the prior art is low in visual quality.

In order to achieve the above object, the present invention provides a video frame synthesis method, including the following steps:

inputting a video frame sequence into a preset mixed space-time convolutional network to obtain semantic features of the video frame sequence under different space-time scales;

performing feature fusion on the semantic features to obtain fused semantic features;

and determining a video frame synthesis result according to the fused semantic features.

Optionally, the preset hybrid space-time convolutional network comprises: presetting a 3D convolution layer and a time domain pooling layer;

the step of inputting the video frame sequence into a preset mixed space-time convolutional network to obtain semantic features of the video frame sequence under different space-time scales comprises:

inputting a video frame sequence to the preset 3D convolution layer for convolution operation to obtain a video frame sequence after 3D convolution;

performing time domain pooling operation on the video frame sequence after the 3D convolution through the preset time domain pooling layer to obtain a video frame sequence with a time dimension being a preset dimension;

and performing convolution operation on the video frame sequence with the preset dimensionality to obtain semantic features of the video frame sequence under different space sizes.

Optionally, the preset hybrid space-time convolutional network further includes: presetting a 2D convolution layer;

performing convolution operation on the video frame sequence with the preset dimensionality to obtain semantic features of the video frame sequence under different spatio-temporal scales, wherein the method comprises the following steps:

and performing convolution operation on the video frame sequence with the preset dimensionality through the preset 2D convolution layer to obtain semantic features of the video frame sequence under different space-time scales.

Optionally, the step of determining a video frame synthesis result according to the fused semantic features includes:

performing adaptive convolution on the fused semantic features to obtain an adaptive convolution result;

performing motion blur correction on the fused semantic features to obtain a motion blur correction result;

and superposing the self-adaptive convolution result and the motion blur correction result to obtain a video frame synthesis result.

Optionally, the step of performing motion blur correction on the fused semantic features to obtain a motion blur correction result includes:

determining the motion degree of each pixel in the video frame sequence based on the fused semantic features;

determining pixels belonging to motion blur according to the motion degree, and determining a motion mask of the pixels belonging to motion blur;

determining correction bias for correcting motion blur according to the fused semantic features;

and determining the motion blur correction result according to the motion mask and the correction bias.

Optionally, the step of performing adaptive convolution on the fused semantic features to obtain an adaptive convolution result includes:

determining a vertical direction motion convolution kernel and a horizontal direction motion convolution kernel corresponding to the video frame sequence according to the fused semantic features;

and simulating a 2D convolution kernel according to the vertical direction motion convolution kernel and the horizontal direction motion convolution kernel, and determining an adaptive convolution result through the 2D convolution kernel.

Optionally, the step of performing feature fusion on the semantic features to obtain fused semantic features includes:

scaling the semantic features to a target spatial resolution through a preset multi-scale space-time feature fusion model;

merging the channel dimensions of each zoomed semantic feature to obtain semantic features to be fused;

and fusing the semantic features to be fused to obtain fused semantic features.

Further, to achieve the above object, the present invention also provides a video frame composition apparatus, comprising:

the semantic feature determination module is used for inputting the video frame sequence into a preset mixed space-time convolution network so as to obtain semantic features of the video frame sequence under different space-time scales;

the fusion module is used for carrying out feature fusion on the semantic features to obtain fused semantic features;

and the synthesis module is used for determining a video frame synthesis result according to the fused semantic features.

Further, to achieve the above object, the present invention also proposes a video frame composition apparatus, comprising: a memory, a processor and a video frame composition program stored on the memory and executable on the processor, the video frame composition program configured to implement the steps of the video frame composition method as described above.

Furthermore, to achieve the above object, the present invention also proposes a storage medium having stored thereon a video frame composition program which, when executed by a processor, implements the steps of the video frame composition method as described above.

The method comprises the steps of inputting a video frame sequence into a preset mixed space-time convolution network to obtain semantic features of the video frame sequence under different space-time scales; performing feature fusion on the semantic features to obtain fused semantic features; and determining a video frame synthesis result according to the fused semantic features. The invention obtains the semantic features of the video frame sequence under different space-time scales, performs feature fusion on the semantic features to obtain fused semantic features, and determines the video frame synthesis result according to the fused semantic features. Compared with the existing mode of synthesizing video frames by densely estimating the motion existing between given video frames, the mode of the invention does not depend on optical flow to guide the synthesis of the video frames, but determines the synthesis result of the video frames by carrying out feature fusion on the semantic features of the video frames under different space-time scales, can avoid obvious artifacts in the synthesized video and improve the quality of the synthesized video frames.

Drawings

Fig. 1 is a schematic structural diagram of a video frame synthesizing device of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a video frame synthesizing method according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a video frame synthesizing method according to a second embodiment of the present invention;

FIG. 4 is a flowchart illustrating a video frame synthesizing method according to a third embodiment of the present invention;

fig. 5 is a block diagram of a video frame synthesizing apparatus according to a first embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a video frame synthesizing device in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the video frame composition apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (WI-FI) interface). The Memory 1005 may be a high-speed Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the architecture shown in fig. 1 does not constitute a limitation of video frame compositing equipment, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and a video frame composition program.

In the video frame composition apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the video frame composition apparatus of the present invention may be provided in the video frame composition apparatus which calls the video frame composition program stored in the memory 1005 through the processor 1001 and executes the video frame composition method provided by the embodiment of the present invention.

Based on the video frame synthesis device, an embodiment of the present invention provides a video frame synthesis method, and referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of the video frame synthesis method according to the present invention.

In this embodiment, the video frame synthesis method includes the following steps:

step S10: inputting a video frame sequence into a preset mixed space-time convolution network so as to obtain semantic features of the video frame sequence under different space-time scales.

It should be noted that the execution main body of this embodiment may be a computing service device with data processing, network communication, and program running functions, such as a mobile phone, a tablet computer, a personal computer, etc., or an electronic device with the same or similar functions, or a Frame Synthesis Network (FSN) network architecture. The present embodiment and the following embodiments will be described below by taking the FSN as an example.

It should be noted that the video frame sequence may be a specified sequence or video segment composed of a plurality of video frames that need to be synthesized. The preset Hybrid Spatio-Temporal convolution Network may be a 3D +2D Hybrid Spatio-Temporal convolution Network (HSTCNN) with a time domain pooling structure. In this embodiment, the preset hybrid space-time convolutional network includes: a preset 3D convolutional layer, a preset time domain pooling layer and a preset 2D convolutional layer. The semantic features may include high-level features, mid-level features, and shallow features. The high-level features may include semantic correctness related features, the middle-level features may include object coherence related features, and the shallow-level features may include edge sharpness related features.

Step S20: and performing feature fusion on the semantic features to obtain fused semantic features.

It should be noted that the above-mentioned feature fusion for the semantic features may be to scale the three levels of features in the semantic features to a uniform spatial resolution through a Multi-scale space-Temporal aggregation (MSTA) structure.

Further, in order to improve the visual quality of the synthesized frame, the step S20 may include: scaling the semantic features to a target spatial resolution through a preset multi-scale space-time feature fusion model; merging the channel dimensions of each zoomed semantic feature to obtain semantic features to be fused; and fusing the semantic features to be fused to obtain fused semantic features.

It should be noted that the target spatial resolution may be a spatial resolution of a shallow feature. The merging of the channel dimensions of each scaled semantic feature may be, after the spatial resolutions of the high-level feature and the middle-level feature are unified into the spatial resolution of the shallow-level feature, merging the channel dimensions of the features with the adjusted resolution, where merging the channel dimensions of the features with the adjusted resolution may be to merge the features with the adjusted resolution into the same dimension. The semantic features to be fused can be semantic features after channels are merged. The fusing of the semantic features to be fused may be fusing the three features by using a set of convolutions to obtain fused semantic features.

Step S30: and determining a video frame synthesis result according to the fused semantic features.

In specific implementation, after the fused semantic features are obtained, adaptive convolution operation can be performed on the fused semantic features through a convolution neural network, so that a video frame synthesis result is obtained.

Further, in consideration of the defect that a part of the image frame after the adaptive convolution may have a blur, in order to improve the quality of the synthesized video frame, the embodiment further performs motion blur correction on the fused semantic features, and then fuses the image frame after the adaptive convolution and the image frame after the motion blur correction, thereby obtaining a video frame synthesis result. Specifically, the fused semantic features can be respectively subjected to adaptive convolution and motion blur correction, and the adaptive convolution result and the motion blur correction result are superimposed to form a video frame synthesis result.

The adaptive convolution and Motion blur correction on the fused semantic features may be performed by using a Motion-blur adaptive convolution model (MA-AdaConv) to perform adaptive convolution and Motion blur correction on the fused semantic features.

In this embodiment, the MA-AdaConv further includes an Adaptive Convolution Stream (ACS) and a Motion-blur Correction Stream (MCS). Wherein the ACS is configured to estimate a convolution kernel of the adaptive convolution to achieve a goal of generating each pixel in the composite frame by an adaptive convolution operation on the input frame; the MCS detects and marks the locations of the drastic motion in the video and then generates a correction offset for the corresponding location pixel. And finally, superposing the correction offset output by the MCS on the synthesized video frame obtained through the ACS corresponding to the correction offset in an addition mode to finally obtain a high-quality video frame synthesis result.

It should be noted that, the adaptive convolution of the fused semantic features may be performed by forming a predictor by using a plurality of convolutional layer kernel upsampling layers included in an adaptive convolution module, and performing dense convolutional kernel prediction on a block-by-block basis. Finally, the video frame pixels are obtained through synthesis by the self-adaptive convolution module, and further the self-adaptive convolution result is obtained. The Adaptive Convolution module may be an Adaptive Convolution Stream (ACS). The motion blur correction of the fused semantic features may be to solve a problem of blur of a corresponding region in a synthesized frame due to a severe motion by presetting a motion blur correction model. The preset Motion blur Correction model may be a Motion-blur Correction Stream (MCS).

It will be appreciated that the ACS already provides good results for the synthesized video frame based on having high quality video features, but for pixels with severe motion in some cases, the synthesis with the ACS only may still result in blurred synthesized results, because these pixels are difficult to obtain by performing adaptive convolution operation on pixels near the corresponding position in the input video frame, which can be covered by the adaptive convolution kernel corresponding to the current position. Although the academia has attempted to propose perceptual loss (perceptual loss) and gradient loss (gradient loss) to help the optimization of the adaptive convolution and thus alleviate the above-mentioned blurring problem, the overall effect is poor. In order to solve the problem of blurring of the corresponding region in the adaptive convolution synthesized frame due to the severe motion, the present embodiment innovatively designs a motion blur correction model to solve the problem, that is, the problem of blurring of the corresponding region in the synthesized frame due to the severe motion is solved by presetting the motion blur correction model.

In specific implementation, this embodiment proposes a novel Frame Synthesis Networks (FSN) network architecture. In summary, the FSN network architecture includes three core components, namely a 3D +2D hybrid spatio-temporal convolution network HSTCNN with a time-domain pooling structure, a multi-scale spatio-temporal feature fusion structure MSTA, and an adaptive convolution structure MA-AdaConv that is resistant to motion blur. For an input sequence of T frame video frames of spatial size H x W, where H and W represent the height and width of the video frames, respectively, the FSN outputs a composite frame that may be interpolated or extrapolated. The input video frame sequence is first sent to HSTCNN, and the forward propagation process of the network is completed without high computational cost. Then, the MSTA structure obtains three groups of low-level, medium-level and high-level video feature descriptions extracted under different empty scales from the HSTCNN, namely the above-mentioned high-level feature, medium-level feature and shallow-level feature, and guarantees that the video features used by the subsequent MA-AdaConv have accurate spatial positioning capability and rich semantic expression capability simultaneously through feature fusion operation. The MA-AdaConv firstly obtains a convolution kernel for adaptive convolution through adaptive convolution flow (ACS) calculation according to the received video characteristics; then, the motion blur correction stream (MCS) judges the motion intensity of different spatial regions in the video according to the input characteristics and marks the regions with the motion intensity by using a motion mask, and the MCS estimates a correction bias map for correcting the motion blur existing in the self-adaptive convolution output result according to the video characteristics; and finally, outputting the synthesized frame result after motion blur correction by a model. The structural design, parameter setting, and operation mechanism of each module will be described in detail in the following embodiments. In the training phase, the embodiment can perform end-to-end training optimization on the whole FSN by minimizing the pixel level difference between the synthesized frame output by the model and the real frame used as the label. The FSN can be directly adopted in the reasoning process to realize end-to-end video frame synthesis.

The embodiment comprises the steps of inputting a video frame sequence into a preset mixed space-time convolutional network to obtain semantic features of the video frame sequence under different space-time scales; performing feature fusion on the semantic features to obtain fused semantic features; and determining a video frame synthesis result according to the fused semantic features. In the embodiment, semantic features of the video frame sequence under different spatio-temporal scales are obtained, feature fusion is performed on the semantic features to obtain fused semantic features, and a video frame synthesis result is determined according to the fused semantic features. Compared with the existing mode of densely estimating the motion existing between given video frames to synthesize the video frames, the mode of the embodiment does not depend on optical flow to guide the synthesis of the video frames, but determines the synthesis result of the video frames by performing feature fusion on the semantic features of the video frames under different space-time scales, can avoid obvious artifacts in the synthesized video, and improve the quality of the synthesized video frames.

Referring to fig. 3, fig. 3 is a flowchart illustrating a video frame synthesizing method according to a second embodiment of the present invention.

Based on the first embodiment described above, in the present embodiment, the step S10 includes:

step S101: and inputting the video frame sequence into the preset 3D convolution layer for convolution operation to obtain the video frame sequence after 3D convolution.

It should be noted that the preset 3D convolutional layer includes at least one 3D convolutional layer and a preset time domain pooling layer, and when there are two 3D convolutional layers, the first 3D convolutional layer adopts a convolution operation with a step of 2, and performs a downsampling operation on the feature map input to this stage, that is, performs a downsampling operation on the input video frame sequence. The second 3D convolutional layer step may be set to 1 to keep the resolution of the feature map unchanged. Thus, the output of each convolution stage has a different spatial resolution. A 3D convolution operation is applied in both the first and second stages of convolution to obtain a sequence of 3D convolved video frames.

Step S102: and performing time domain pooling operation on the video frame sequence after the 3D convolution through the preset time domain pooling layer to obtain the video frame sequence with the time dimension being the preset dimension.

It should be noted that, in order to synthesize a single video frame according to the 3D convolved video frame sequence, the preset dimension may be a dimension with a time dimension of 1. The time-domain pooling operation performed on the 3D convolved video frame sequence through the preset time-domain pooling layer may be to merge the time dimensions of the input video frame sequence to 1 through the time-domain pooling operation of the preset time-domain pooling layer.

Step S103: and performing convolution operation on the video frame sequence with the preset dimensionality to obtain semantic features of the video frame sequence under different space-time scales.

It should be noted that, the performing convolution operation on the video frame sequence with the preset dimension to obtain the semantic features of the video frame sequence under different spatio-temporal scales may be performing convolution operation on the video frame sequence with the preset dimension through a preset 2D convolutional layer or a 3D convolutional layer to obtain the semantic features of the video frame sequence under different spatio-temporal scales. In order to save the overhead of convolution operation, a preset 2D convolution layer may be used to perform convolution operation on the video frame sequence with the preset dimensionality, so as to obtain semantic features of the video frame sequence under different spatio-temporal scales.

It should be noted that the preset 2D convolutional layer may be a convolutional layer with two steps of 1, so as to keep the resolution of the feature map unchanged.

It should be understood that the four basic convolution operation units available in a convolutional neural network include pass-through 2D convolution, residual 2D convolution, pass-through 3D convolution and residual 3D convolution,

wherein, the straight-through or residual 2D convolution block: the straight-through 2D convolution module realizes the target x of feature learning through the 2D convolution layer for direct cascade connection _t+1 ＝F _Conv2D (x _t ) In which F is _Conv2D Represents a non-linear function based on 2D convolution. In order to solve the gradient vanishing problem faced by the straight-through 2D convolution, the residual 2D convolution designed based on the residual learning idea can be expressed as follows:

x _t+1 ＝x _t +F _Conv2D (x _t )

wherein F _Conv2D Is used for learning x _t+1 And x _t The residual error between.

Straight-through or residual 3D volume block: a natural way to encode spatiotemporal information is to directly upgrade the 2D convolution in the two 2D volume blocks to a 3D convolution, and then model the spatial information in each frame of the input video sequence and the spatial correlation between frames at the same time. The corresponding operation is expressed here as the following formula:

x _t+1 ＝F _Conv3D (x _t )

x _t+1 ＝x _t +F _Conv3D (x _t )

in a specific implementation, the principle of using only the above 4 different types of convolution calculation units is followed in this embodiment, and in the case of using only 2D blocks or 2D and 3D blocks, this embodiment designs a CNN structure, i.e., HSTCNN, for encoding input video spatio-temporal information, which includes two possible design schemes. One is the "Early Fusion" (Early Fusion) scheme.

It should be understood that, in the prior art, the technical feature adopted for synthesizing video frames through a 2D network is to connect the input T video frames in series in the channel dimension, and extend the number of channels input by the network from 3 to 3 × T, so that it is only necessary to modify the filter of the first convolutional layer in the convolutional block to a version supporting 3 × T channels, and information in the video frame sequence can be mined directly through the cascade 2D-CNN. However, it can be found in practice that this scheme of direct "early fusion" at the pixel level has difficulty in capturing large motion information and complex temporal relationships. Therefore, in contrast, the present embodiment innovatively proposes a 2D +3D fusion design solution, which is a method for implementing time dimension information fusion by combining 3D and 2D computing units, and can regard T input video frames as a three-dimensional cube with size T × H × W, learn low-level visual motion information in time and space dimensions using multiple 3D convolution modules, and then further learn high-level feature expressions by using a time-average pooling operation and multiple 2D convolution modules to help the synthesis of video frames

In a specific implementation, this embodiment is used for HSTCNN in FSN with 6 different CNN structures, where in Plain2D-15, the time information is "early fused", followed by a cascaded 2D convolution, and Res2D-15, which replaces the Plain2D block with Res2D block in stages 2 to 4. Res (3D +2D) -15 selects the 3D +2D fusion scheme, which uses Plain3D and Res3D blocks in the first and second stages, respectively. Plain2D-29, Res2D-29 and Res (3D +2D) -29 were obtained by adding extra computational units in stages 2, 3, 4. Reference may be made specifically to table 1 below, which is a specific structural configuration of the network, and table 1 summarizes the specific structural configuration of the network:

TABLE 1 concrete structural configuration of the network

Referring to table 1, 3 × 3, 32 in table 1 represent convolutions in which the length and width in the 2D convolution is 3, respectively, and the number of convolution kernels or channels is 32; 3 × 3, 32 in table 1 represent convolutions in which the length, width, and height in the 3D convolution are 3, respectively, and the number of convolution kernels or channels is 32; [3 × 3,128] is used to characterize a convolution unit, 2/4 is used to characterize a cascade connection that can cascade the convolution units into 2 or 4, and with the help of HSTCNN, this embodiment can obtain a good spatio-temporal feature expression of the input video frame sequence with a controlled computational overhead. However, different from the traditional tasks of video classification, motion recognition and the like which only require that video features are rich in semantic information, in the task of video frame synthesis based on adaptive convolution, not only the video features obtained by the network are required to contain sufficient semantic information to represent the high-level semantic content of the video, but also the video features are required to have sufficiently accurate spatial positioning capability, and thus, high-quality convolution kernel parameters of the adaptive convolution can be further estimated according to the video features.

Generally, the semantic information richness and the spatial accurate positioning capability of features in a convolutional neural network are contradictory: the shallow network is limited by the size of the receptive field, only the local areas of the input image video with small and medium scales, such as edges, textures, colors and the like, can be concerned, and the semantic content in the video image is difficult to reflect from the global scale, but the spatial positioning precision is high due to the property; the deep network can obtain a large receptive field through several times of pooling in the network, and further can abstract semantic information in the deep network in a large range, and the spatial positioning accuracy of the deep network is insufficient due to the large coverage range of the deep network. The most direct method for solving the contradiction is to simultaneously use networks with different depths and different scales to simultaneously obtain features with different attribute tendencies from input images/videos in a parallel mode, but the mode brings huge expense on computing resources.

Based on the above defects, the multi-scale spatio-temporal feature fusion structure MSTA is adopted in this embodiment, which is similar to the above concept of connecting networks of different depths in parallel, and by multiplexing a relatively shallow network structure, networks of three different depths are fused into one network (i.e., HSTCNN in the above), so that the consumption of more computing resources is avoided under the condition of possessing video features under different scales. The characteristics of the shallow network output comprise a large amount of local edge information similar to noise; the middle layer network begins to focus on a large range of structural information, such as human body profiles, road profiles, umbrella profiles, building profiles, etc.; the deep network mainly reflects the area closely related to the semantic content of the video (people running), and the local edge information almost disappears. In order to fully utilize video features under different scales to obtain a good adaptive convolution kernel aiming at the input video and obtain a video frame synthesis result with reasonability (high-level features, semantic correctness), coherence (middle-level features, object coherence) and clarity (shallow features, edge sharpness), MSTA firstly scales the features of the three levels to a uniform spatial resolution (namely the spatial resolution of the shallow features), and uses a group of extra convolutions to fuse the three features after channel dimension combination so as to be used by a subsequent motion blur resistant adaptive convolution model.

In this embodiment, a video frame sequence is input to the preset 3D convolution layer to perform convolution operation, so as to obtain a 3D convolved video frame sequence; performing time domain pooling operation on the video frame sequence after the 3D convolution through the preset time domain pooling layer to obtain a video frame sequence with a time dimension being a preset dimension; and performing convolution operation on the video frame sequence with the preset dimensionality through the preset 2D convolution layer to obtain semantic features of the video frame sequence under different space-time scales. So that a high quality video frame composition result can be obtained.

Referring to fig. 4, fig. 4 is a flowchart illustrating a video frame synthesizing method according to a third embodiment of the present invention.

Based on the foregoing embodiments, in this embodiment, the step S30 includes:

step 301: and performing self-adaptive convolution on the fused semantic features to obtain a self-adaptive convolution result.

It should be noted that, the performing adaptive convolution on the fused semantic features, and obtaining an adaptive convolution result may be determining a vertical direction motion convolution kernel and a horizontal direction motion convolution kernel corresponding to the video frame sequence according to the fused semantic features; and simulating a 2D convolution kernel according to the vertical direction motion convolution kernel and the horizontal direction motion convolution kernel, and determining an adaptive convolution result through the 2D convolution kernel.

Determining a vertical direction motion convolution kernel and a horizontal direction motion convolution kernel corresponding to the video frame sequence according to the fused semantic features, wherein the vertical direction motion convolution kernel and the horizontal direction motion convolution kernel corresponding to the video frame sequence are established through a preset 1D convolution kernel according to the fused semantic features; and simulating a 2D convolution kernel in an adaptive convolution model according to the vertical direction motion convolution kernel and the horizontal direction motion convolution kernel, and determining an adaptive convolution result through the 2D convolution kernel. The adaptive convolution result may be a video frame synthesized by a preset adaptive convolution model.

In a specific implementation, the existing adaptive convolution module is used in the classical video frame interpolation and extrapolation work based on optical flow or motion map, and the two-step scheme of motion estimation and pixel synthesis is often adopted. Such schemes rely heavily on the results of motion estimation, which is not reliable. However, even the more advanced route of adaptive convolution techniques has certain drawbacks. The basic adaptive convolution method is to treat pixel synthesis directly as a partial convolution operation on the input video frameThe result is referred to herein as a single-step scheme. At each pixel location (x, y), the single-step approach described above can estimate a 2D convolution kernel K for each incoming video frame _i (x,y)∈R ^s×s And using the convolution kernel and the sum at I _i Pixel (x, y) -centered 2D image blocks P of a frame _i (x,y)∈R ^s×s A convolution operation is performed. The output pixel values are the result of performing the above-described convolution operation on each input frame by accumulation. In this process, the convolution kernel K _i (x, y) both motion information and pixel resampling information are captured and used for subsequent synthesis. The elegant mechanism can seamlessly embed the task of video frame synthesis into the convolutional neural network and model the visual content changes caused by motion well. One limitation of this scheme is the 2D convolution kernel K _i (x, y). On the one hand, the quality requirements for composite frames drive the convolution kernel to be large enough to contain sufficient information, while being limited by limited computational resources, parameter size, and memory space, it is difficult for the method to estimate the corresponding convolution kernel for all pixels simultaneously. This method can only estimate the 2D convolution kernel on a pixel-by-pixel basis.

To solve the above problem, inspired by separable convolution, the present embodiment employs a pair of 1D convolution kernels, each for modeling the vertical motion K _i,v (x,y)∈R ^s×1 And a horizontal movement K _i,h (x,y)∈R ^1×s And further achieves the goal of simulating a 2D convolution kernel in an adaptive convolution stream. In particular, the adaptive convolution model includes a plurality of convolution kernel upsampling layers to form a predictor for performing block-by-block dense convolution kernel prediction. Finally, the video frame I composed by the adaptive convolution module _i The pixel (x, y) in (a) is calculated by the following formula:

wherein, denotes a convolution operation,

the outer product calculation is represented as a function of,

video frame, P, for characterizing adaptive convolution module synthesis _i (x, y) for characterizing a 2D image block centered on a pixel (x, y), K _i (x, y) is used to characterize a convolution of size s, k _i,v (x, y) is used to characterize the convolution in the vertical direction,

for characterizing the convolution in the horizontal direction.

Step S302: and performing motion blur correction on the fused semantic features to obtain a motion blur correction result.

It should be noted that the motion blur correction is performed on the fused semantic features, and the motion blur correction result is obtained by performing motion blur correction on the fused semantic features through a preset motion blur correction model. The preset Motion blur Correction model may be a Motion-blur Correction Stream (MCS). The motion blur correction of the fused semantic features through the preset motion blur correction model may be a certain correction of pixels with strong motion in the fused semantic features through the preset motion blur correction model.

Further, in order to obtain a high-quality composite video frame, the step S302 may include: determining the motion degree of each pixel in the video frame sequence based on the fused semantic features; determining pixels belonging to motion blur according to the motion degrees, and determining a motion mask of the pixels belonging to the motion blur; determining correction bias for correcting motion blur according to the fused semantic features; determining the motion blur correction result according to the motion mask and the correction bias.

It should be noted that, the motion degree of each pixel in the video frame sequence is determined based on the fused semantic features; determining pixels belonging to motion blur according to the motion degree, and determining a motion mask of the pixels belonging to motion blur; determining a correction bias for correcting motion blur according to the fused semantic features; determining the motion blur correction result from the motion mask and the correction bias may be determining the motion blur correction result from a preset motion blur correction model

The determining the degree of motion of each pixel in the video frame sequence based on the fused semantic features may be determining the degree of motion of each pixel in the video frame sequence through a preset motion blur correction model based on the fused semantic features. The motion blur correction method may detect a motion region in a video frame through the preset motion blur correction model, and obtain a motion degree of each pixel in the motion region. The determining of the motion mask of the pixel belonging to the motion blur may be determining an initial motion mask of a pixel with intense motion according to a motion degree of each pixel in the motion region; and normalizing the initial motion mask to obtain the motion mask. The normalizing the initial motion mask may be normalizing the initial motion mask to be within a (0,1) interval by a sigmoid function. The determining of the correction bias for correcting the motion blur according to the fused semantic features may be obtaining the correction bias for correcting the motion blur pixels according to a video feature regression.

Step S303: and superposing the self-adaptive convolution result and the motion blur correction result to obtain a video frame synthesis result.

It should be noted that, the overlaying the adaptive convolution result and the motion blur correction result to obtain the video frame synthesis result may be to calculate the video frame synthesis result by the following formula:

where (·) denotes an element-by-element multiplication operation,

for characterizing the composite video frame,

video frames, M, for characterizing adaptive convolution module synthesis _I For characterizing moving masks, R _I For characterizing the correction bias.

It should be understood that the ACS described above can already provide good results for the synthesized video frame based on having high quality video features, but for pixels with severe partial motion, the synthesis with the ACS only may still result in blurred synthesized results, because these pixels are difficult to obtain by performing adaptive convolution operation on pixels near the corresponding position in the input video frame, which can be covered by the adaptive convolution kernel corresponding to the current position.

In an embodiment, the ACS outputs the result

The composite video frame can be well approximated in most areas

Some correction is needed for those pixels with more intense motion that are less approximate. The embodiment selects to estimate the residual error value

To indirectly determine the value of the pixel to be corrected, which helps to simplify the learning objective of the neural network and helps the solver to perform the optimization of the model. The MCS has the same prediction structure as the ACS. Its output consists of two parts, respectively: 1. motion mask M for marking motion-intense pixels obtained by detecting motion-intense regions in video _I Normalizing the mask to be within a (0,1) interval through a sigmoid function; 2. correction bias R for correcting motion blurred pixels obtained by regression of video features _I . Although it is possible to generate coverage entirely without the need for a maskThe offset is corrected for the spatial extent of the video frame, but this approach uses the masking approach described above in view of the fact that doing so may cause additional noise to the otherwise near-perfect region and thus adversely affect it. Finally, the FSN synthesized video frame is calculated by the following formula:

in a specific implementation, the training and optimization of FSN may be: for measuring composed video frames

And a true value I _gt The difference between these, we utilize a simple L on each color channel ₁ Norm loss function to evaluate the loss function pixel by pixel

The true value I will be used in FSN _gt Additional constraints are placed on the adaptive convolution stream, which implicitly constitutes constraints for residual learning. Thus the final loss function L _FSN Can be expressed as:

where λ is used to balance the over-parameters of the two loss function terms, which in this embodiment can be set to 1. In the embodiment, a Caffe deep learning framework is adopted for code implementation. In the training process, parameters of the network are optimized by using an Adam optimizer, where β 1 in the Adam optimizer is 0.9, β 2 in the Adam optimizer is 0.999, the learning rate is initialized to 0.0001, and the batch size is set to 128, and in this embodiment, the whole training process needs 50000 training optimization iterations.

The invention is based on two video data sets: experiments are carried out on UCF-101 and Kinetics-HD, MAE, RMSE, PSNR and SSIM are adopted as performance evaluation indexes, and feasibility and performance advantages of video frame synthesis are proved. Firstly, comparing the video frame synthesis performance of different space-time convolution neural network structures through experiments. Table 2 below-table of performance of different network architectures for video frame interpolation and extrapolation synthesis tasks summarizes the Performance (PSNR) of video frame interpolation and extrapolation synthesis tasks using different network architectures on the Kinetics-HD data set.

TABLE 2-Performance Table for different network architectures for video frame interpolation and extrapolation synthesis tasks

As can be seen from Table 2, Architecture in Table 2 is used for characterizing the network structure, # param is used for characterizing the number of parameters corresponding to the network structure, # Inter is used for characterizing the interpolated evaluation score and Extra is used for characterizing the extrapolated evaluation score, and Res (3D +2D) -29 network structure has both the interpolated evaluation score and the extrapolated evaluation score higher than those of other network structures as can be seen from Table 2. Therefore, the CNN structure using 3D +2D fusion can achieve better frame synthesis performance than the "early fusion" scheme when using the pain 2D or Res2D computation blocks, which indicates that there is a great advantage in learning the spatio-temporal feature expression of the video by combining the 3D and 2D computation blocks. At the same time, the 3D +2D fusion scheme slightly increases the number of parameters. Res (3D +2D) -29 has better performance than Res (3D +2D) -15 with a deep architecture in FSN. Unless otherwise noted, in the following evaluations, we refer to Res (3D +2D) -29 as the CNN architecture in FSN. Next the FSN and other baseline methods proposed in this example are compared on two data sets. Wherein the Average method directly takes the Average value of the input frame as the interpolation result; the TVL1 refers to a method for video frame synthesis based on an optical flow estimation method; AdaConv, SepConv and DVF are all reference methods based on convolutional neural networks.

The following table 3 shows that the performance of each method in the video frame interpolation synthesis task is summarized in the performance table of each method in the video frame interpolation synthesis task, and the comparison shows that the FSN of the present application has obvious performance advantages on two evaluation data sets.

TABLE 3 Performance Table for methods in the video frame interpolation composition task

Table 4 below-performance comparison table in video frame extrapolation task the performance comparison in video frame extrapolation task is summarized, as well as the advantages of the FSN method are verified.

TABLE 4 Performance Compare Table in video frame extrapolation task

Simulating a 2D convolution kernel in a preset self-adaptive convolution model according to the fused semantic features, and determining a self-adaptive convolution result through the 2D convolution kernel; performing motion blur correction on the fused semantic features through a preset motion blur correction model to obtain a motion blur correction result; and superposing the self-adaptive convolution result and the motion blur correction result to obtain a video frame synthesis result. The method and the device can better model the spatio-temporal information in the video so as to help obtain a better video frame synthesis result, and the designed network structure is more efficient and the resource overhead is more controllable.

Referring to fig. 5, fig. 5 is a block diagram illustrating a video frame synthesizing apparatus according to a first embodiment of the present invention.

As shown in fig. 5, the video frame synthesizing apparatus according to the embodiment of the present invention includes:

a semantic feature determining module 10, configured to input a video frame sequence to a preset hybrid spatio-temporal convolutional network, so as to obtain semantic features of the video frame sequence at different spatio-temporal scales;

a fusion module 20, configured to perform feature fusion on the semantic features to obtain fused semantic features;

and the synthesis module 30 is configured to determine a video frame synthesis result according to the fused semantic features.

The embodiment comprises the steps of inputting a video frame sequence into a preset mixed space-time convolutional network to obtain semantic features of the video frame sequence under different space-time scales; performing feature fusion on the semantic features to obtain fused semantic features; and determining a video frame synthesis result according to the fused semantic features. In the embodiment, semantic features of the video frame sequence under different spatio-temporal scales are obtained, feature fusion is performed on the semantic features to obtain fused semantic features, and a video frame synthesis result is determined according to the fused semantic features. Compared with the existing mode of intensively estimating the motion existing between given video frames to synthesize the video frames, the mode of the embodiment does not depend on optical flow to guide the synthesis of the video frames, but determines the synthesis result of the video frames by performing feature fusion on the semantic features of the video frames under different space-time scales, can avoid obvious artifacts from appearing in the synthesized video, and improves the quality of the synthesized video frames.

It should be noted that the above-described work flows are only illustrative, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them according to actual needs to implement the purpose of the solution of the embodiment, and the present invention is not limited herein.

In addition, the technical details that are not described in detail in this embodiment may refer to the video frame synthesis method provided in any embodiment of the present invention, and are not described herein again.

A second embodiment of the video frame composition apparatus according to the present invention is proposed based on the first embodiment of the video frame composition apparatus according to the present invention.

In this embodiment, the semantic feature determining module 10 is further configured to input a video frame sequence to the preset 3D convolution layer for convolution operation, so as to obtain a 3D convolved video frame sequence;

Further, the semantic feature determining module 10 is further configured to perform convolution operation on the video frame sequence with the preset dimensionality through the preset 2D convolution layer, so as to obtain the semantic features of the video frame sequence at different empty scales.

Further, the synthesis module 30 is further configured to perform adaptive convolution on the fused semantic features to obtain an adaptive convolution result;

Further, the synthesizing module 30 is further configured to determine a degree of motion of each pixel in the video frame sequence based on the fused semantic features;

Further, the synthesis module 30 is further configured to determine a vertical direction motion convolution kernel and a horizontal direction motion convolution kernel corresponding to the video frame sequence according to the fused semantic features; and simulating a 2D convolution kernel according to the vertical direction motion convolution kernel and the horizontal direction motion convolution kernel, and determining an adaptive convolution result through the 2D convolution kernel.

Further, the fusion module 20 is further configured to scale the semantic features to a target spatial resolution through a preset multi-scale spatiotemporal feature fusion model; merging the channel dimensions of each zoomed semantic feature to obtain a semantic feature to be fused; and fusing the semantic features to be fused to obtain fused semantic features.

Other embodiments or specific implementations of the video frame synthesis apparatus of the present invention may refer to the above embodiments of the methods, and are not described herein again.

Furthermore, an embodiment of the present invention further provides a storage medium, on which a video frame composition program is stored, and the video frame composition program, when executed by a processor, implements the steps of the video frame composition method as described above.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., a rom/ram, a magnetic disk, an optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the present specification and drawings, or used directly or indirectly in other related fields, are included in the scope of the present invention.

Claims

1. A video frame composition method, comprising the steps of:

2. The video frame synthesis method of claim 1, wherein the pre-defined hybrid spatio-temporal convolutional network comprises: presetting a 3D convolution layer and a time domain pooling layer;

the step of inputting the video frame sequence into a preset mixed spatio-temporal convolutional network to obtain semantic features of the video frame sequence at different spatio-temporal scales comprises:

and performing convolution operation on the video frame sequence with the preset dimensionality to obtain semantic features of the video frame sequence under different space-time scales.

3. The video frame synthesis method of claim 2, wherein the predetermined hybrid spatio-temporal convolutional network further comprises: presetting a 2D convolution layer;

performing convolution operation on the video frame sequence with the preset dimensionality to obtain semantic features of the video frame sequence under different space-time scales, wherein the method comprises the following steps of:

4. The video frame synthesis method according to claim 1, wherein the step of determining the video frame synthesis result according to the fused semantic features comprises:

5. The video frame synthesis method according to claim 4, wherein the step of performing motion blur correction on the fused semantic features to obtain a motion blur correction result comprises:

6. The method for synthesizing video frames according to claim 4, wherein the step of performing adaptive convolution on the merged semantic features to obtain an adaptive convolution result comprises:

7. The video frame synthesis method according to any one of claims 1 to 6, wherein the step of performing feature fusion on the semantic features to obtain fused semantic features comprises:

and fusing the semantic features to be fused to obtain fused semantic features.

8. A video frame synthesizing apparatus, characterized in that the video frame synthesizing apparatus comprises:

9. A video frame composition apparatus, characterized in that the apparatus comprises: a memory, a processor and a video frame composition program stored on the memory and executable on the processor, the video frame composition program being configured to implement the steps of the video frame composition method of any of claims 1 to 7.

10. A storage medium having stored thereon a video frame composition program which, when executed by a processor, implements the steps of the video frame composition method of any of claims 1 to 7.