CN112215140A

CN112215140A - 3-dimensional signal processing method based on space-time countermeasure

Info

Publication number: CN112215140A
Application number: CN202011083124.2A
Authority: CN
Inventors: 侯兴松; 李瑞敏
Original assignee: Suzhou Tianbiyou Technology Co ltd
Current assignee: Suzhou Tianbiyou Technology Co ltd
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2021-01-12

Abstract

The invention discloses a 3-dimensional signal processing method based on space-time countermeasure, wherein a space-time countermeasure network comprises a cycle generator, an optical flow estimation network and a space-time discriminator; a loop generator for recursively generating high resolution video frames from a low resolution input; the optical flow estimation network learns the motion compensation between frames; the space-time arbiter may consider spatial and temporal aspects and penalize unrealistic temporal discontinuities in the results without overly smoothing the image content. The invention solves the problem of obvious reduction of visual effect under various and fuzzy motions in super-frequency super-resolution, and simultaneously fully utilizes time information in the video to ensure the space-time consistency of the super-resolution video.

Description

3-dimensional signal processing method based on space-time countermeasure

Technical Field

The invention relates to the field of video super-resolution, in particular to a video super-resolution method based on a space-time countermeasure network.

Background

The spatial resolution of the video depends on the spatial density of the image sensor, motion, system noise, etc. The temporal resolution of the video depends on the frame rate and exposure time of the camera. When the temporal resolution is low, the video may exhibit motion blur and motion aliasing. In recent years, with the application and development of deep learning in computer vision, CNN-based video object detection and motion recognition have made remarkable progress. However, most neural networks for object detection and motion recognition are trained with high-resolution video, and thus the effect of directly applying the trained neural networks to low-resolution video is not ideal and the performance is significantly reduced. In aerial photography and remote sensing videos, targets are often small, detection difficulty is large, and the detection method is especially suitable for low resolution. One possible solution to this is to perform super-resolution on the video before detection and identification.

In early studies on video super-resolution reconstruction, it was often considered as a simple extension of image super-resolution reconstruction, so that temporal redundancy between adjacent frames was not fully exploited. Previous multi-frame/video super-resolution methods are mainly based on reconstruction and exploit inter-frame coherence. Most of the motion estimation methods are based on a Bayesian framework, and the sub-pixel precision motion estimation is carried out by adopting an optical flow technology. These methods can guarantee high fidelity in the presence of small global motion. However, they tend to fail when the motion is more vigorous.

In recent years, research for improving visual quality and fidelity by combining the representation ability of deep learning with inter-frame time consistency has also achieved certain results. In order to grasp temporal consistency, most of the conventional methods use a sliding frame window, and generate one high-resolution frame using a plurality of low-resolution frames as input. To process spatiotemporal information simultaneously, existing methods typically employ temporal fusion techniques such as motion compensation, Bidirectional Recursive Convolutional Networks (BRCN), LSTM, etc. The three-dimensional convolution (C3D) shows excellent performance in video learning. Some researchers have improved BRCN using C3D so that the model can flexibly gain access to different temporal contexts in a natural way, but the network is still shallow. For the loss function, in video super-resolution, the existing method still mainly uses standard loss such as mean square error, rather than the adversarial loss photoresistive video over resolution (pvsr), and proposes to use the adversarial loss, however, the method mainly aims at a pure spatial discriminator.

Disclosure of Invention

The invention aims to provide a space-time countermeasure based 3-dimensional signal processing method (a video super-resolution method based on a space-time countermeasure network) to solve the problem that the visual effect is obviously reduced under various and fuzzy motions in super-frequency super-resolution, and meanwhile, time information in a video is fully utilized to ensure the space-time consistency of the super-resolution video.

In order to achieve the purpose, the technical scheme of the invention is to design a space-time countermeasure based 3-dimensional signal processing method (a space-time countermeasure network based video super-resolution method) and construct a space-time countermeasure network, wherein the space-time countermeasure network comprises a cycle generator G, an optical flow estimation network F and a space-time discriminator D_(s，t)；

The loop generator G for recursively generating high resolution video frames from a low resolution input; the optical flow estimation network F learns the motion compensation between frames; the space-time discriminator is the core part of the method, can consider the space and time aspects, and punishs unreal time discontinuity in the result under the condition of not excessively smoothing the image content;

the circular generator is based on a circular convolution network coupled with the optical flow estimation network F; the loop generator derives Low Resolution (LR) frames x from_tGenerating a High Resolution (HR) output g_tAnd recursively using the previously generated HR outputs g_t-1(ii) a The loop generator only learns residual information and then adds it to the low resolution input of the bicubic interpolation;

the space-time arbiter receives two sets of inputs: generating a frame according to the truth value; the two sets of data have the same structure, including: three adjacent HR frames, three corresponding upsampled LR frames, and three warded HR frames; the loop generator can be provided with the information about the authenticity of the spatial details and the gradient of the time change through the training of the loss function; by taking both spatial and temporal inputs into account, a spatio-temporal discriminator D_(s，t)Automatically balancing spatial and temporal aspects, avoiding inconsistent sharpness and over-sharpnessAnd (6) smoothing the result.

Preferably, the 3-dimensional signal processing method based on the spatiotemporal countermeasure comprises the following steps:

1) previous frame x_t-1And the current frame x_tInputting the data into an optical flow estimation network F to generate motion information v_t；

2)v_tAmplifying the size through bicubic interpolation;

3) large size v_tHigh resolution frame g from the previous frame obtained by super resolution_t-1Fusion to give w (g)_t-1，v_t)；

4)w(g_t-1，v_t) And x_tInput into a cycle generating network G which learns the residual r between them_t；

5)x_tBicubic interpolation, adding the residual error learned by the circularly generated network to generate a high-resolution frame g_t；

6) The loop generator has two inputs: a real frame and a generated frame; the two sets of data have the same structure, including: three adjacent HR frames, three LR frames corresponding to bicubic interpolation, and three warp HR frames;

7) the generator is penalized by the arbiter.

Preferably, the Loss of PP-Loss is successfully eliminated by applying PP Loss function training and resisting the Loss PP-Loss in space and time while keeping proper high-frequency details; furthermore, such a lossy construction effectively increases the size of the training data set, and thus represents a useful form of data augmentation.

Preferably, a PCD alignment module and a TSA fusion module are added; the PCD module is an alignment module added with a pyramid structure, a cascade structure and a deformable convolution; wherein the alignment of the frames is realized by gradually thinning on the feature layer by using deformable convolution; effectively fusing different frames under various motion and fuzzy states, and adding a space-time attention fusion module; where the spatiotemporal attention fusion module TSA, the attention mechanism applies to both time and space to emphasize important features for subsequent recovery.

The invention provides a 3-dimensional signal processing method based on space-time antagonism, which solves the problem that the visual effect is obviously reduced under various and fuzzy motions in super-frequency super-resolution, and simultaneously fully utilizes the time information in a video to ensure the space-time consistency of the super-resolution video.

Compared with the prior art, the invention has the following beneficial technical effects:

the invention provides a space-time discriminator, which is different from the simple mean square error loss adopted in a video super-resolution method based on a deep neural network, and the scheme of the invention provides the space-time countermeasure loss and considers the inconsistency of time and space; the scheme of the invention adopts the GAN network and learns the spatiotemporal information of the video through the antagonism training of the generator and the discriminator.

Further, considering the problem of how to align multiple frames under large motion and how to effectively fuse different frames under various motion and fuzzy states, the PCD alignment module and the TSA fusion module are added in the generator, so that optical flow information can be better estimated and a super-resolution frame with higher space-time consistency can be generated by fusion.

Drawings

FIG. 1 is a general flow chart of the present invention;

FIG. 2 is a network framework diagram of the present invention;

FIG. 3 is a PCD module frame diagram.

Detailed Description

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

As shown in fig. 1 to 3, the technical solution of the present invention is as follows:

1. a generator: based on a cyclic convolution network coupled to an optical flow estimation network F; the generator is from low resolution(LR) frame x_tGenerating a High Resolution (HR) output g_tAnd recursively using the previously generated HR outputs g_t-1(ii) a The generator only learns the residual information and then adds it to the low resolution input of the bicubic interpolation.

2. A discriminator that receives two sets of inputs: generating a frame according to the truth value; the two sets of data have the same structure, including: three adjacent HR frames, three corresponding upsampled LR frames, and three warpedHR frames.

3. The generation network and the countermeasure network in the invention adopt a VGG-19 network framework.

4. The optical flow estimation network F is added into a PCD module, and the PCD alignment module shown in FIG. 3 aligns the features of each frame by utilizing deformable convolution; in fig. 1, t is a reference frame, and (t + i) is an adjacent frame, as indicated by a black dotted line, the feature of the l-th layer of the (t + i) frame is obtained by down-sampling the feature of the (l-1) layer; the dashed gray line represents that the offset and alignment features of the l-th layer can be predicted by upsampling the offset and alignment features of the (l +1) layer; the gray background part in the figure means that after the pyramid structure, an alignment part with deformable convolution is cascaded to further refine the alignment characteristic.

The TSA module obtains mapping (embeddings) through a simple convolution filter according to the alignment characteristics of the t frame and the adjacent frames of the t frame, performs dot product operation on the mapping (embeddings), and limits the output to [0, 1] by using a sigmod function, so that the frame similarity in a mapping space is obtained, the purpose of time attention is to calculate the frame similarity in the mapping space, and intuitively, the adjacent frames similar to the reference frame should be paid more attention in the mapping space; the temporal attention map is then multiplied with the original alignment features in a pixel-level manner, these features are aggregated using an additional fused convolutional layer, a spatial attention mask is computed from the fused features, after which the fused features are modulated (by dot multiplication and addition).

6. The detailed execution flow of the invention is as follows:

6.1) previous frame x_t-1And the current frame x_tInput into an optical flow network F to generate motion information v_t；

6.2)v_tAmplifying the size through bicubic interpolation;

6.3) Large size v_tHigh resolution frame g from the previous frame obtained by super resolution_t-1Fusion to give w (g)_t-1，v_t)；

6.4)w(g_t-1，v_t) And x_tInput into a generating network G which learns the residual r between them_t；

6.5)x_tBicubic interpolation, adding the generated net work residual error to generate high resolution frame g_t；

6.6) the generator has two inputs: a real frame and a generated frame; the two sets of data have the same structure, including: three adjacent HR frames, three LR frames corresponding to bicubic interpolation, and three warp HR frames;

6.7) training by using a PP loss function, and providing the generator with information about the authenticity of the spatial details and the gradient of the time change; by simultaneously taking into account the space-time input, discriminator D_(s，t)The spatial and temporal aspects are automatically balanced, avoiding inconsistent sharpness and overly smooth results;

7. and testing and evaluating the trained network model. The evaluation criterion was PSNR.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the technical principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A3-dimensional signal processing method based on space-time countermeasure is characterized in that a space-time countermeasure network is constructed, and the space-time countermeasure network comprises a cycle generator G, an optical flow estimation network F and a space-time discriminator D_(s，t)；

The loop generator G for recursively generating high resolution video frames from a low resolution input; the optical flow estimation network F learns the motion compensation between frames; the time-space discriminator punishs unreal time discontinuity in the result under the condition of not excessively smoothing the image content;

the circular generator is based on a circular convolution network coupled with the optical flow estimation network F; the loop generator derives Low Resolution (LR) frames x from_tGenerating a High Resolution (HR) output g_tAnd recursively using the previously generated HR outputs g_t-1(ii) a The loop generator learns the residual information and then adds it to the low resolution input of the bicubic interpolation;

the space-time arbiter receives two sets of inputs: generating a frame according to the truth value; the two sets of data have the same structure, including: three adjacent HR frames, three corresponding upsampled LR frames, and three warded HR frames; training through a loss function to provide the loop generator with information about the authenticity of the spatial detail and the gradient of the temporal variation; by taking both spatial and temporal inputs into account, a spatio-temporal discriminator D_(s，t)The spatial and temporal aspects are automatically balanced to avoid inconsistent sharpness and overly smooth results.

2. The spatio-temporal countermeasure-based 3-dimensional signal processing method according to claim 1, comprising the steps of:

2)v_tAmplifying the size through bicubic interpolation;

7) the generator is penalized by the arbiter.

3. The spatio-temporal confrontation-based 3-dimensional signal processing method according to claim 1, characterized in that a PP loss function is applied for training; the space-time countermeasure Loss PP-Loss successfully eliminates the drift artifact while retaining appropriate high frequency details; furthermore, such a lossy construction effectively increases the size of the training data set, and thus represents a useful form of data augmentation.

4. The space-time countermeasure based 3-dimensional signal processing method according to claim 1, wherein a PCD alignment module and a TSA fusion module are added; the PCD module is an alignment module added with a pyramid structure, a cascade structure and a deformable convolution; wherein the alignment of the frames is realized by gradually thinning on the feature layer by using deformable convolution; effectively fusing different frames under various motion and fuzzy states, and adding a space-time attention fusion module; where the spatiotemporal attention fusion module TSA, the attention mechanism applies to both time and space to emphasize important features for subsequent recovery.