CN113610707A - Video super-resolution method based on time attention and cyclic feedback network - Google Patents

Video super-resolution method based on time attention and cyclic feedback network Download PDF

Info

Publication number
CN113610707A
CN113610707A CN202110838280.3A CN202110838280A CN113610707A CN 113610707 A CN113610707 A CN 113610707A CN 202110838280 A CN202110838280 A CN 202110838280A CN 113610707 A CN113610707 A CN 113610707A
Authority
CN
China
Prior art keywords
super
resolution
video
target frame
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110838280.3A
Other languages
Chinese (zh)
Other versions
CN113610707B (en
Inventor
张庆武
朱鉴
蔡金峰
陈炳丰
蔡瑞初
郝志峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202110838280.3A priority Critical patent/CN113610707B/en
Publication of CN113610707A publication Critical patent/CN113610707A/en
Application granted granted Critical
Publication of CN113610707B publication Critical patent/CN113610707B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Television Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a video super-resolution method based on time attention and a cyclic feedback network, which applies the characteristics of different contribution degrees of visual information provided by adjacent frames with different distances from a target frame to a super-resolution reconstruction effect, a feedback mechanism of a human visual system and a cyclic feedback guidance characteristic in a process of learning new knowledge by human to a video super-resolution technology, adopts a time attention module to learn an attention diagram of a video sequence on a time axis, and can effectively distinguish the contribution of the adjacent frames with different time degrees to the final reconstruction effect; after the video sequences are rearranged, the cyclic feedback module carries out cyclic feedback hyper-resolution to finally obtain a super-resolution network model, and the model has the characteristic of emphatically learning information with high contribution degree to hyper-resolution reconstruction and strong high-level feature learning capability, so that the video super-resolution effect is improved.

Description

Video super-resolution method based on time attention and cyclic feedback network
Technical Field
The invention relates to the technical field of video processing, in particular to a video super-resolution method based on time attention and a circular feedback network.
Background
The video super-resolution method is a method of generating a high-resolution video from a low-resolution video, and has been widely studied for decades as a typical computer vision problem. Not only has great significance in theory, but also has urgent need in practical application. For example, in the aspect of video monitoring, banks, stations, airports, residential areas and the like are provided with a plurality of monitoring cameras, and the video quality can be improved and the detailed information of people and articles can be conveniently observed through a video super-resolution technology; in the aspect of traffic management, because a scene observed by a camera is large, detailed information of vehicles running at high speed and pedestrians cannot be acquired, a multi-video super-resolution reconstruction technology is utilized, the illegal or accident process of the vehicles can be reproduced in more detail, and the license plate or the face of a certain person in a large scene can be identified; in the aspect of criminal investigation work, for low-resolution videos (such as videos shot by cameras in occasions such as banks and streets) obtained in a case scene, the video super-resolution technology is utilized, so that the video quality can be improved; in sports, there are often many objects moving at high speed to be captured (e.g. tennis, table tennis, etc.), and the video super-resolution reconstruction can help us to observe the details of these dynamic events more clearly. With the development of video super-resolution related theories and technologies, video super-resolution has become one of the hot research problems in the field of computer vision.
Compared with the single-frame image super-resolution, the video super-resolution task increases the time sequence information. Video super-resolution techniques based on deep learning can be broadly classified into (1) methods based on multi-frame concatenation, according to different ways of using timing information; (2) 3D convolution based methods; (3) a cyclic structure based approach.
The method based on multi-frame concatenation can be regarded as converting single-frame super-resolution into multi-frame input. If the method wants to use good time sequence information, the adjacent frames can not be aligned to the target frame, and the frame alignment mode can be divided into optical flow alignment and deformable convolution alignment. The EDVR network proposed by Wang et al belongs to a Deformable convolution alignment method [1] Wang X, Chan K, Yu K, et al. RBPN Network [2] Haris M, Shakhnarovich G, Ukita N.Current Back-project Network for Video Super-Resolution [ C ]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,2019 belongs to optical flow alignment in this type of method, RBPN utilizes information of adjacent frames by combining the ideas of SISR and MISR, which often introduces excessive noise due to alignment at the pixel level, thereby affecting the accuracy of the final reconstruction result. The method based on multi-frame cascading well utilizes the advantage of multi-frame information complementation, but only fuses the characteristic cascading together and does not really represent the motion information between frames.
The method based on the 3D convolution is to process time sequence information in a video by utilizing the characteristic that the 3D convolution can learn time information, and Cabilllero et al firstly proposes that the 3D convolution can be regarded as a slow inter-frame information fusion process. Huang et al propose BRCN model [3] Y Huang, W Wang, L Wang. bidirectional recovery conditional Networks for Multi-Frame Super-Resolution [ J ] MIT Press,2015 by using the idea of 3D convolution in combination with RNN, but their work still uses a shallow network and the information that can be learned is very limited. Thus, FSTRN [4] Li S, He F, Du B, et al, fast spatial-Temporal reactive Network for Video Super-Resolution [ J ] 2019, proposed by Li et al, employs a deep 3D convolutional Network with hopping connections, in which separable 3D convolutions are used to reduce the amount of 3D convolution computation.
The method based on the loop structure performs the time sequence information fusion in the video through RNN, LSTM, etc. The first proposed of such methods is a bi-directional RNN, which has a small network capacity and no subsequent inter-frame alignment steps. Guo et al improves the bi-directional RNN by employing a motion compensation module and a convolutional LSTM layer. Recent advances in video super-resolution (VSR) have shown a strength of deep learning, which can achieve better reconstruction performance. However, the existing video SR methods based on deep learning basically gradually fuse the input multi-frame timing information, and obtain the final result after a reconstruction. In the existing method, (1) the characteristic that the contribution degree of visual information provided by adjacent frames with different distances from a target frame to the super-resolution reconstruction effect is different is not fully utilized in the aspect of time sequence information utilization; (2) the feedback mechanisms common in the human visual system, and the cyclic feedback guidance features in the process of human learning new knowledge, have not been fully utilized.
Disclosure of Invention
The invention aims to overcome at least one technical problem, provides a video super-resolution method based on time attention and a loop feedback network, constructs a model with the characteristic of emphatically learning information with high contribution degree to super-resolution reconstruction and strong high-level feature learning capability, and effectively improves the effect of video super-resolution.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a video super-resolution method based on time attention and a loop feedback network comprises the following steps:
s1: constructing a super-resolution network model, which comprises a time attention module and a cyclic feedback module;
s2: acquiring a public video super-resolution training data set from a network and preprocessing the data set to acquire a training low-resolution (LR) video sequence;
s3: determining a target frame needing to be subjected to over-scoring, and performing up-sampling on the target frame to obtain a preliminary over-scoring result of the target frame lacking details;
s4: inputting the LR video sequence and the initial super-resolution result into a super-resolution network model, extracting a feature map of the LR video sequence, and aligning the feature map to a target frame by adopting deformable convolution to obtain an aligned LR feature map sequence;
s5: inputting the aligned LR characteristic map sequence into a time attention module to obtain an LR characteristic map sequence after attention of a time dimension;
s6: after the LR characteristic diagram sequence is reordered, inputting the LR characteristic diagram sequence into a cyclic feedback module for cyclic feedback overcutting, and acquiring a cyclic feedback overcutting result sequence of a target frame;
s7: setting a loss function according to the cyclic feedback hyper-resolution result sequence, training the super-resolution network model, and acquiring the trained super-resolution network model;
s8: and performing super-resolution reconstruction on the to-be-super-resolution video by using the trained super-resolution network model.
Wherein the video hyper-resolution training dataset is obtained from an existing public high resolution dataset Vimeo-90 k.
In step S2, the preprocessing process specifically includes:
s21: intercepting original video frames with the same frame number at the same position for all video super-resolution training data;
s22: downsampling an original video frame to obtain an LR video frame;
s23: converting all LR video frames into a tensor data structure, and carrying out normalization processing;
s24: and carrying out random data enhancement operation on the normalized LR video frame.
In step S22, a downsampling operation is performed on the original video frame by using a gaussian kernel fuzzy downsampling method.
In step S3, a bicubic interpolation upsampling method is used to perform upsampling on a target frame that needs to be subjected to super-segmentation, so as to obtain a preliminary super-segmentation result that the target frame lacks details.
The super-resolution network model further comprises a multi-scale feature extraction module; in step S4, inputting an LR video sequence into a multi-scale feature extraction module, and obtaining a feature map of k sizes for each video, where k is a positive integer;
specifically, the alignment operation is performed on the target frame by adopting deformable convolution, specifically, a PCD feature alignment module at the front end of an EDVR model is adopted, the feature is extracted to obtain a feature map of each size, the feature map is input into the feature alignment module, and the deformable convolution alignment operation is performed upwards step by step from small to large according to the sizes to obtain an alignment feature map sequence (F) aligned to the target frame1,…,Fc,…,Fn) (ii) a Where n denotes the number of frames of the input LR video sequence, FnLR feature map representing the nth video frame, FcAn LR feature map of the target frame is shown.
Wherein, in the step S5, the time attention module is composed of a BN layer and a convolutional layer; the specific implementation process is as follows:
will align LR feature map (F)1,…,Fc,…,Fn) In the sequence input time attention module, a single-channel feature map (F) is obtained after the BN layer and the convolutional layer calculation1 a,…,Fc a,…,Fn a) Further cascading it; then, calculating the weight along the time dimension by a softmax function to obtain an attention weight map (M)1,…,Mc,…,Mn) Wherein the sum of the n weight maps at the same position is 1; finally, multiplying the LR feature map with the aligned LR feature map to obtain an attention-paid LR feature map sequence (F)1 at,…,Fc at,…,Fn at) Namely:
Fn at=Mn⊙Fn,n∈[1:n]。
in step S6, the loop feedback and over-division processing specifically includes:
s61: after the LR characteristic diagram sequence is reordered in sequence, inputting a first characteristic diagram corresponding to the LR characteristic diagram sequence into a cyclic feedback module for carrying out first cyclic feedback super-division to obtain a super-division characteristic diagram;
s62: rebuilding the super-resolution feature map obtained by the super-resolution to obtain the super-resolution residual error information of the target frame, and adding the super-resolution residual error information to the primary super-resolution result of the target frame to obtain the super-resolution result of the target frame;
s63: and according to the LR characteristic diagram sequence, sequentially inputting the corresponding characteristic diagram and the super-resolution characteristic diagram output by the previous cycle of the target frame into a cycle feedback module to perform cycle feedback super-resolution until a cycle result is obtained, and obtaining the multi-time super-resolution result of the target frame to obtain the cycle feedback super-resolution result sequence of the target frame.
Wherein, the step S61 specifically includes:
map LR signature (F)1 at,…,Fc at,…,Fn at) Reordering is carried out from near to far according to the distance from the target frame, and in the middle of the sequence, the feature map of the target frame is reused at the end position for guiding the residual error information extraction of the cyclic feedback hyper-segmentation module, namely:
Figure RE-GDA0003273853040000041
LR feature map sequence to be reordered
Figure RE-GDA0003273853040000042
The input cycle feedback module carries out n +2 times of cycle feedback overcenter according to the sequence of the feature graph, the input content of each cycle overcenter is an LR video frame feature graph corresponding to the cycle and a feature graph output at the end of the previous cycle, and the output result is the feature graph of the cycle overcenter, namely:
Figure RE-GDA0003273853040000043
wherein the content of the first and second substances,
Figure RE-GDA0003273853040000051
a super-resolution feature map representing a target frame of the nth-cycle feedback super-resolution output, fFB(. X) represents a cyclic feedback hyper-divide module,
Figure RE-GDA0003273853040000052
a super-resolution feature map representing a target frame of the n-1 st loop feedback super-resolution output; when the circulation is carried out for the first time,
Figure RE-GDA0003273853040000053
namely;
Figure RE-GDA0003273853040000054
in step S62, reconstructing the super-resolution feature map obtained by each loop feedback to obtain the reconstructed super-resolution residual information of the target frame, and adding the reconstructed super-resolution residual information to the preliminary super-resolution result of the target frame to obtain the super-resolution result of the target frame; the method specifically comprises the following steps:
the sub-super-divided feature map of the target frame
Figure RE-GDA0003273853040000055
Inputting into a super-resolution reconstruction module for reconstruction to obtain the reconstructed residual information of the target frame
Figure RE-GDA0003273853040000056
Namely:
Figure RE-GDA0003273853040000057
wherein the content of the first and second substances,
Figure RE-GDA0003273853040000058
representing the hyper-reconstructed residual information of the target frame of the nth cycle, fRB() represents a reconstruction module; adding the super-resolution reconstruction residual information of the target frame and the preliminary super-resolution result of the target frame obtained in the step S3 at the pixel position corresponding to the pixel level to obtain the cyclic super-resolution video frame of the target frame, that is:
Figure RE-GDA0003273853040000059
wherein the content of the first and second substances,
Figure RE-GDA00032738530400000510
the sub-super-divided video frame representing the target frame of the nth loop, fup(. X.) denotes an upsampling operation, ICRepresenting the target frame.
In step S7, the loss function is an L2 norm loss function, which is specifically represented as:
Figure RE-GDA00032738530400000511
wherein, WnRepresenting the over-divided result of the target frame obtained by the nth cycle
Figure RE-GDA00032738530400000512
The ratio of the calculated loss in the total loss function, IHRA ground route representing a target frame;
and then, the video super-resolution training data set carries out iterative training on the constructed super-resolution network model, and finally the trained super-resolution network model is obtained.
In the scheme, firstly, a target frame needing to be subjected to super-resolution in a video sequence is subjected to bicubic interpolation up-sampling to obtain a preliminary super-resolution result of the target frame lacking details; then inputting an LR video sequence of a video frame sequence contained in the training video data set after Gaussian blur kernel degradation into a video super-resolution network model for feature map extraction and feature map alignment to obtain an LR feature map of the aligned video frame sequence; and the obtained LR characteristic map sequence passes through a time attention module, and an attention map of the video sequence on a time axis is learned, so that the contribution degree of adjacent frames of different time spans to the final reconstruction effect is distinguished.
Then, rearranging LR characteristic graphs of the video frame sequence after the attention of time according to the distance from the target frame to the far, and reusing the characteristic graphs of the target frame at the middle and the tail of the video sequence for circularly feeding back and guiding the characteristic learning of the frame with the far distance; finally, the LR characteristic diagram of the rearranged video frame sequence is subjected to stepwise loop feedback hyper-division operation to obtain a hyper-division characteristic diagram with higher layer characteristics; finally, reconstructing the hyper-resolution feature map of the target frame to obtain reconstructed hyper-resolution residual error information of the target frame, and adding the reconstructed hyper-resolution residual error information and the initial hyper-resolution result frame of the target frame to obtain a final hyper-resolution video frame of the target frame; and finally, circulating the hyper-resolution LR characteristic diagram sequence until all the characteristic frames are input into the circulating feedback module to obtain the hyper-resolution frame sequence of the target frame to complete the hyper-resolution. Training a video super-resolution network model by setting a loss function to obtain a trained super-resolution network model, and performing super-resolution reconstruction on a to-be-super-resolution video by using the trained super-resolution network model; the method effectively improves the video super-resolution effect, and the detail effect of the reconstructed video frame is obviously improved.
According to the video super-resolution method based on the time attention and the cyclic feedback network, the characteristics that the contribution degree of visual information provided by adjacent frames with different distances from a target frame to the super-resolution reconstruction effect is different, the feedback mechanism of a human visual system and the cyclic feedback guidance characteristic in the process of learning new knowledge by human are applied to the video super-resolution technology, so that the model has the characteristic of emphatically learning information with high contribution degree to the super-resolution reconstruction and strong high-level feature learning capacity, and the video super-resolution effect is improved.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention provides a video super-resolution method based on time attention and a loop feedback network, which adopts a time attention module to learn an attention diagram of a video sequence on a time axis, and can effectively distinguish the contribution of adjacent frames with different time degrees to the final reconstruction effect; after the video sequences are rearranged, the cyclic feedback module carries out cyclic feedback super-division to finally obtain a super-resolution network model, the video super-resolution effect is obviously improved, and the detail reconstruction effect of the reconstructed video frames is better and obvious.
Drawings
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;
FIG. 2 is an expanded view of the loop feedback module according to an embodiment of the present invention;
fig. 3 is a data flow diagram of the system according to an embodiment of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, a video super-resolution method based on time attention and loop feedback network includes the following steps:
s1: constructing a super-resolution network model, which comprises a time attention module and a cyclic feedback module;
s2: acquiring a public video super-resolution training data set from a network and preprocessing the data set to acquire a training low-resolution (LR) video sequence;
in this embodiment, the video in the existing public high-resolution data set Vimeo-90k is selected as training video data, and the video data is preprocessed.
S3: determining a target frame needing to be subjected to over-scoring, and performing up-sampling on the target frame to obtain a preliminary over-scoring result of the target frame lacking details;
in this embodiment, the training video data is 5 frames, an intermediate frame is selected as a target frame to be subjected to super-segmentation, and a bicubic interpolation up-sampling operation is performed on the target frame to obtain a preliminary super-segmentation video frame.
S4: inputting the LR video sequence and the initial super-resolution result into a super-resolution network model, extracting a feature map of the LR video sequence, and aligning the feature map to a target frame by adopting deformable convolution to obtain an aligned LR feature map sequence;
in this embodiment, as shown in fig. 2, the video super-resolution network model includes a multi-scale feature extraction module, a deformable convolution alignment module, a time attention module, a cyclic feedback module, and a feature hyper-resolution module; inputting the normalized 5-frame video frame sequence into a multi-scale feature extraction module of a video super-resolution network model, wherein the multi-scale feature extraction module consists of 5 basic residual blocks and obtains a multi-scale feature group through convolution downsampling; the deformable convolution alignment module is specifically a PCD feature alignment module at the front end of the existing EDVR model;
inputting the normalized 5 frames of LR video frames into a multi-scale feature extraction module, and obtaining feature maps of 3 sizes from large to small in each frame of video;
inputting the feature map of each size into a feature alignment module to perform deformable convolution alignment from small to large and fusion operation of different size feature maps to obtain an alignment feature map of a 5-frame video sequence.
S5: inputting the aligned 5-frame LR feature map sequence into a time attention module to obtain the LR feature map sequence after attention of a time dimension, wherein the time attention module consists of a BN layer and a 3x3 convolution layer;
aligning the aligned feature map sequence (F)1,F2,F3,F4,F5) First through the BN layer and then through a 3x3 convolution to compute a single channel feature map (F)1 a,F2 a,F3 a,F4 a,F5 a). They are further cascaded and then weighted along the time dimension by a softmax function to obtain an attention weight map (M)1,M2,M3,M4,M5) And the sum of the 5 weight map time axes at the same spatial position is equal to 1, and finally multiplied by the aligned feature map sequence to obtain an attention-paid LR feature map sequence (F)1 at,F2 at,F3 at,F4 at,F5 at) Namely:
Fn at=Mn⊙Fn,n∈[1:5]
s6: after the LR characteristic diagram sequence is reordered, inputting the LR characteristic diagram sequence into a cyclic feedback module for cyclic feedback overcutting, and acquiring a cyclic feedback overcutting result sequence of a target frame;
s7: setting a loss function according to the cyclic feedback hyper-resolution result sequence, training the super-resolution network model, and acquiring the trained super-resolution network model;
s8: and performing super-resolution reconstruction on the to-be-super-resolution video by using the trained super-resolution network model.
More specifically, in step S2, the preprocessing process specifically includes:
s21: intercepting original video frames with the length of 448 and the width of 256 at the same position for all video hyper-resolution training data;
s22: the Gaussian blur kernel downsampling method downsamples the original video frame to reduce the original video frame by 4 times to obtain an LR video frame with the length of 112 and the width of 64;
s23: converting all LR video frames into a tensor data structure, and carrying out normalization processing;
s24: and carrying out random data enhancement operation on the normalized LR video frame, wherein the data enhancement operation comprises turning operation and mirroring operation.
More specifically, in step S6, the loop feedback and over-division processing specifically includes:
s61: after the LR characteristic diagram sequence is reordered in sequence, inputting a first characteristic diagram corresponding to the LR characteristic diagram sequence into a cyclic feedback module for carrying out first cyclic feedback super-division to obtain a super-division characteristic diagram;
s62: rebuilding the super-resolution feature map obtained by the super-resolution to obtain the super-resolution residual error information of the target frame, and adding the super-resolution residual error information to the primary super-resolution result of the target frame to obtain the super-resolution result of the target frame;
s63: and according to the LR characteristic diagram sequence, sequentially inputting the corresponding characteristic diagram and the super-resolution characteristic diagram output by the previous cycle of the target frame into a cycle feedback module to perform cycle feedback super-resolution until a cycle result is obtained, and obtaining the multi-time super-resolution result of the target frame to obtain the cycle feedback super-resolution result sequence of the target frame.
More specifically, the step S61 specifically includes:
map LR signature (F)1 at,F2 at,F3 at,F4 at,F5 at) Reordering is carried out from near to far according to the distance from the target frame, and in the middle of the sequence, the feature map of the target frame is reused at the end position for guiding the residual error information extraction of the cyclic feedback hyper-segmentation module, namely:
Figure RE-GDA0003273853040000081
subscript renumbering becomes (F)1 at,F2 at,F3 at,F4 at,F5 at,F6 at,F7 at);
As shown in FIG. 3, in this embodiment, 7 sets of LR feature maps (F) are used1 at,F2 at,F3 at,F4 at,F5 at,F6 at,F7 at) The input feedback module carries out cyclic feedback overcutting according to the sequence of the feature maps, the input content of each cyclic overcutting is an LR video frame feature map corresponding to the cycle and an overcutting feature map of a target frame output at the end of the previous cycle, and the output result is the feature map of the cyclic overcutting;
iteration 1, n is 1:
Figure RE-GDA0003273853040000091
wherein the content of the first and second substances,
Figure RE-GDA0003273853040000092
a hyper-resolution feature map representing the target frame of the 1 st loop feedback hyper-resolution output, fFB(. X) represents a cyclic feedback hyper-divide module,
Figure RE-GDA0003273853040000093
a hyper-resolution feature map representing a target frame of the 0 th loop feedback hyper-resolution output; when the circulation is carried out for the first time,
Figure RE-GDA0003273853040000094
namely;
Figure RE-GDA0003273853040000095
the sub-super-divided feature map of the target frame
Figure RE-GDA0003273853040000096
Inputting into a super-resolution reconstruction module for reconstruction to obtain the reconstructed residual information of the target frame
Figure RE-GDA0003273853040000097
Namely:
Figure RE-GDA0003273853040000098
wherein the content of the first and second substances,
Figure RE-GDA0003273853040000099
representing the hyper-reconstructed residual information of the target frame of the 1 st cycle, fRB() represents a reconstruction module;
adding the super-resolution reconstruction residual information of the target frame and the preliminary super-resolution result of the target frame obtained in the step S4 at the pixel position corresponding to the pixel level to obtain the cyclic super-resolution video frame of the target frame, that is:
Figure RE-GDA00032738530400000910
wherein the content of the first and second substances,
Figure RE-GDA00032738530400000911
the sub-super-divided video frame, f, representing the target frame of the 1 st cycleup(. X.) denotes an upsampling operation, ICRepresenting a target frame;
then, LR feature map sequences (F)1 at,F2 at,F3 at,F4 at,F5 at,F6 at,F7 at) Inputting a loop feedback block until the loop is finished, and obtaining a 7-time loop feedback hyper-resolution result sequence of the target frame;
iteration 2, n is 2:
Figure RE-GDA00032738530400000912
Figure RE-GDA00032738530400000913
Figure RE-GDA00032738530400000914
wherein the content of the first and second substances,
Figure RE-GDA00032738530400000915
a super-resolution feature map of a target frame representing the super-resolution output of the 2 nd loop feedback,
Figure RE-GDA00032738530400000916
the super-resolution reconstructed residual information representing the target frame of the 2 nd loop,
Figure RE-GDA00032738530400000917
a hyper-divided video frame representing a target frame of the 2 nd cycle;
iteration 7, n-7:
Figure RE-GDA0003273853040000101
Figure RE-GDA0003273853040000102
Figure RE-GDA0003273853040000103
wherein the content of the first and second substances,
Figure RE-GDA0003273853040000104
a 7 th cycle feedback hyper-division output target frame hyper-division characteristic diagram,
Figure RE-GDA0003273853040000105
the super-resolution reconstructed residual information representing the target frame of the 7 th loop,
Figure RE-GDA0003273853040000106
a hyper-divided video frame representing the target frame of the 7 th cycle;
the final super-divided video frame of the target frame is formed into a final super-divided video frame sequence of the target frame
Figure RE-GDA0003273853040000107
More specifically, in step S7, the loss function is an L2 norm loss function, which is specifically represented as:
Figure RE-GDA0003273853040000108
wherein, WnRepresenting the over-divided result of the target frame obtained by the nth cycle
Figure RE-GDA0003273853040000109
The ratio of the calculated loss in the total loss function is n-7; i isHRA ground route representing a target frame; in this example, WnThe values are all 1.
And then, the video super-resolution training data set carries out iterative training on the constructed super-resolution network model, and finally the trained super-resolution network model is obtained.
In this embodiment, the final super-resolution video frame of the target frame obtained by 7 times of loop feedback
Figure RE-GDA00032738530400001010
All used for calculating loss functions, and taking the hyper-resolution video frame of the target frame fed back in the last cycle
Figure RE-GDA00032738530400001011
As a target frame ICThe result of the over-classification of (1).
In a specific implementation process, the method provided by the embodiment is adopted to carry out super-resolution reconstruction on the to-be-super-resolution video, the video super-resolution effect can be effectively improved under the condition of less parameters, the detail effect of the reconstructed video frame is excellent, and powerful support is provided for the technical fields of satellite images, video monitoring, medical imaging, military and the like.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A video super-resolution method based on time attention and a loop feedback network is characterized by comprising the following steps:
s1: constructing a super-resolution network model, which comprises a time attention module and a cyclic feedback module;
s2: acquiring a public video super-resolution training data set from a network and preprocessing the data set to acquire a trained LR video sequence;
s3: determining a target frame needing to be subjected to over-scoring, and performing up-sampling on the target frame to obtain a preliminary over-scoring result of the target frame lacking details;
s4: inputting the LR video sequence and the initial super-resolution result into a super-resolution network model, extracting a feature map of the LR video sequence, and aligning the feature map to a target frame by adopting deformable convolution to obtain an aligned LR feature map sequence;
s5: inputting the aligned LR characteristic map sequence into a time attention module to obtain an LR characteristic map sequence after attention of a time dimension;
s6: after the LR characteristic diagram sequence is reordered, inputting the LR characteristic diagram sequence into a cyclic feedback module for cyclic feedback overcutting, and acquiring a cyclic feedback overcutting result sequence of a target frame;
s7: setting a loss function according to the cyclic feedback hyper-resolution result sequence, training the super-resolution network model, and acquiring the trained super-resolution network model;
s8: and performing super-resolution reconstruction on the to-be-super-resolution video by using the trained super-resolution network model.
2. The video super-resolution method based on the temporal attention and loop feedback network of claim 1, wherein in the step S2, the preprocessing process specifically includes:
s21: intercepting original video frames with the same frame number at the same position for all video super-resolution training data;
s22: downsampling an original video frame to obtain an LR video frame;
s23: converting all LR video frames into a tensor data structure, and carrying out normalization processing;
s24: and carrying out random data enhancement operation on the normalized LR video frame.
3. The video super-resolution method based on temporal attention and loop feedback network of claim 2, wherein in step S22, the original video frame is downsampled by using gaussian kernel fuzzy downsampling method.
4. The video super-resolution method based on temporal attention and loop feedback network of claim 1, wherein in step S3, a bicubic interpolation up-sampling method is used to perform an up-sampling operation on the target frame to be super-divided, so as to obtain a preliminary super-division result of the target frame lacking details.
5. The video super-resolution method based on the temporal attention and loop feedback network of claim 1, wherein the super-resolution network model further comprises a multi-scale feature extraction module; in step S4, inputting an LR video sequence into a multi-scale feature extraction module, and obtaining a feature map of k sizes for each video, where k is a positive integer;
specifically, the alignment operation is performed on the target frame by adopting deformable convolution, specifically, a PCD feature alignment module at the front end of an EDVR model is adopted, the feature is extracted to obtain a feature map of each size, the feature map is input into the feature alignment module, and the deformable convolution alignment operation is performed upwards step by step from small to large according to the sizes to obtain an alignment feature map sequence (F) aligned to the target frame1,…,Fc,…,Fn) (ii) a Where n denotes the number of frames of the input LR video sequence, FnLR feature map representing the nth video frame, FcAn LR feature map of the target frame is shown.
6. The method for super-resolution of videos based on temporal attention and loop feedback network of claim 5, wherein in step S5, the temporal attention module is composed of a BN layer and a convolutional layer; the specific implementation process is as follows:
will align LR feature map (F)1,…,Fc,…,Fn) In the sequence input time attention module, a single-channel feature mapping is obtained after a BN layer and convolution layer calculation
Figure RE-FDA0003273853030000021
Further cascading it; then, calculating a weight value along a time dimension through a softmax function to obtainAttention weight graph (M)1,…,Mc,…,Mn) Wherein the sum of the n weight maps at the same position is 1; finally, multiplying the LR characteristic map with the aligned LR characteristic map to obtain an LR characteristic map sequence after attention
Figure RE-FDA0003273853030000022
Namely:
Figure RE-FDA0003273853030000023
7. the video super-resolution method based on temporal attention and loop feedback network of claim 6, wherein in step S6, the loop feedback super-resolution processing procedure specifically comprises:
s61: after the LR characteristic diagram sequence is reordered in sequence, inputting a first characteristic diagram corresponding to the LR characteristic diagram sequence into a cyclic feedback module for carrying out first cyclic feedback super-division to obtain a super-division characteristic diagram;
s62: rebuilding the super-resolution feature map obtained by the super-resolution to obtain the super-resolution residual error information of the target frame, and adding the super-resolution residual error information to the primary super-resolution result of the target frame to obtain the super-resolution result of the target frame;
s63: and according to the LR characteristic diagram sequence, sequentially inputting the corresponding characteristic diagram and the super-resolution characteristic diagram output by the previous cycle of the target frame into a cycle feedback module to perform cycle feedback super-resolution until a cycle result is obtained, and obtaining the multi-time super-resolution result of the target frame to obtain the cycle feedback super-resolution result sequence of the target frame.
8. The video super-resolution method based on the temporal attention and loop feedback network of claim 7, wherein the step S61 specifically comprises:
sequence of LR profiles
Figure RE-FDA0003273853030000031
According to distanceReordering the target frame from near to far, and in the middle of the sequence, reusing the feature map of the target frame at the end position to guide the residual information extraction of the loop feedback super-partition module, namely:
Figure RE-FDA0003273853030000032
LR feature map sequence to be reordered
Figure RE-FDA0003273853030000033
The input cycle feedback module carries out n +2 times of cycle feedback overcenter according to the sequence of the feature graph, the input content of each cycle overcenter is an LR video frame feature graph corresponding to the cycle and a feature graph output at the end of the previous cycle, and the output result is the feature graph of the cycle overcenter, namely:
Figure RE-FDA0003273853030000034
wherein the content of the first and second substances,
Figure RE-FDA0003273853030000035
a super-resolution feature map representing a target frame of the nth-cycle feedback super-resolution output, fFB(. X) represents a cyclic feedback hyper-divide module,
Figure RE-FDA0003273853030000036
a super-resolution feature map representing a target frame of the n-1 st loop feedback super-resolution output; when the circulation is carried out for the first time,
Figure RE-FDA0003273853030000037
namely;
Figure RE-FDA0003273853030000038
9. the video super-resolution method based on time attention and loop feedback network of claim 8, wherein in step S62, the super-resolution feature map obtained by each loop feedback is reconstructed to obtain the super-resolution residual information of the time reconstruction of the target frame, and the super-resolution residual information is added to the preliminary super-resolution result of the target frame to obtain the super-resolution result of the time of the target frame; the method specifically comprises the following steps:
the sub-super-divided feature map of the target frame
Figure RE-FDA0003273853030000039
Inputting into a super-resolution reconstruction module for reconstruction to obtain the reconstructed residual information of the target frame
Figure RE-FDA00032738530300000310
Namely:
Figure RE-FDA00032738530300000311
wherein the content of the first and second substances,
Figure RE-FDA00032738530300000312
representing the hyper-reconstructed residual information of the target frame of the nth cycle, fRB() represents a reconstruction module; adding the super-resolution reconstruction residual information of the target frame and the preliminary super-resolution result of the target frame obtained in the step S3 at the pixel position corresponding to the pixel level to obtain the cyclic super-resolution video frame of the target frame, that is:
Figure RE-FDA00032738530300000313
wherein the content of the first and second substances,
Figure RE-FDA00032738530300000314
the sub-super-divided video frame representing the target frame of the nth loop, fup(+) denotes an upsampling operation, ICRepresenting the target frame.
10. The video super-resolution method based on temporal attention and loop feedback network of claim 9, wherein in step S7, the loss function is an L2 norm loss function, which is specifically expressed as:
Figure RE-FDA00032738530300000315
wherein, WnRepresenting the over-divided result of the target frame obtained by the nth cycle
Figure RE-FDA0003273853030000041
The ratio of the calculated loss in the total loss function, IHRA ground route representing a target frame;
and then, the video super-resolution training data set carries out iterative training on the constructed super-resolution network model, and finally the trained super-resolution network model is obtained.
CN202110838280.3A 2021-07-23 2021-07-23 Video super-resolution method based on time attention and cyclic feedback network Active CN113610707B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110838280.3A CN113610707B (en) 2021-07-23 2021-07-23 Video super-resolution method based on time attention and cyclic feedback network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110838280.3A CN113610707B (en) 2021-07-23 2021-07-23 Video super-resolution method based on time attention and cyclic feedback network

Publications (2)

Publication Number Publication Date
CN113610707A true CN113610707A (en) 2021-11-05
CN113610707B CN113610707B (en) 2024-02-09

Family

ID=78305300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110838280.3A Active CN113610707B (en) 2021-07-23 2021-07-23 Video super-resolution method based on time attention and cyclic feedback network

Country Status (1)

Country Link
CN (1) CN113610707B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114598874A (en) * 2022-01-20 2022-06-07 中国科学院自动化研究所 Video quantization coding and decoding method, device, equipment and storage medium
CN114612305A (en) * 2022-03-14 2022-06-10 中国科学技术大学 Event-driven video super-resolution method based on stereogram modeling

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111260560A (en) * 2020-02-18 2020-06-09 中山大学 Multi-frame video super-resolution method fused with attention mechanism
WO2020220926A1 (en) * 2019-04-28 2020-11-05 北京灵汐科技有限公司 Multimedia data identification method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020220926A1 (en) * 2019-04-28 2020-11-05 北京灵汐科技有限公司 Multimedia data identification method and device
CN111260560A (en) * 2020-02-18 2020-06-09 中山大学 Multi-frame video super-resolution method fused with attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
XINTAO WANG等: "EDVR: Video Restoration with Enhanced Deformable Convolutional Networks", Retrieved from the Internet <URL:arxiv> *
ZHEN LI: "Feedback Network for Image Super-Resolution", Retrieved from the Internet <URL:arxiv> *
陶状;廖晓东;沈江红;: "双路径反馈网络的图像超分辨重建算法", 计算机系统应用, no. 04 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114598874A (en) * 2022-01-20 2022-06-07 中国科学院自动化研究所 Video quantization coding and decoding method, device, equipment and storage medium
CN114598874B (en) * 2022-01-20 2022-12-06 中国科学院自动化研究所 Video quantization coding and decoding method, device, equipment and storage medium
CN114612305A (en) * 2022-03-14 2022-06-10 中国科学技术大学 Event-driven video super-resolution method based on stereogram modeling
CN114612305B (en) * 2022-03-14 2024-04-02 中国科学技术大学 Event-driven video super-resolution method based on stereogram modeling

Also Published As

Publication number Publication date
CN113610707B (en) 2024-02-09

Similar Documents

Publication Publication Date Title
CN113362223B (en) Image super-resolution reconstruction method based on attention mechanism and two-channel network
CN111179167B (en) Image super-resolution method based on multi-stage attention enhancement network
CN109360171A (en) A kind of real-time deblurring method of video image neural network based
US11727541B2 (en) Video super resolution method
Zhang et al. NTIRE 2023 challenge on image super-resolution (x4): Methods and results
CN112365403B (en) Video super-resolution recovery method based on deep learning and adjacent frames
CN113610707A (en) Video super-resolution method based on time attention and cyclic feedback network
CN113538243B (en) Super-resolution image reconstruction method based on multi-parallax attention module combination
CN112001843A (en) Infrared image super-resolution reconstruction method based on deep learning
CN103971354A (en) Method for reconstructing low-resolution infrared image into high-resolution infrared image
CN114757862B (en) Image enhancement progressive fusion method for infrared light field device
CN113409190B (en) Video super-resolution method based on multi-frame grouping and feedback network
Song et al. Dual perceptual loss for single image super-resolution using esrgan
CN114119694A (en) Improved U-Net based self-supervision monocular depth estimation algorithm
CN113379606A (en) Face super-resolution method based on pre-training generation model
CN112991167A (en) Aerial image super-resolution reconstruction method based on layered feature fusion network
Xing et al. A small object detection solution by using super-resolution recovery
CN116188778A (en) Double-sided semantic segmentation method based on super resolution
Pang et al. Video super-resolution using a hierarchical recurrent multireceptive-field integration network
CN114881849A (en) Depth image super-resolution reconstruction method combining monocular depth estimation
CN113538456B (en) Image soft segmentation and background replacement system based on GAN network
Song et al. Transformer-Based Video Deinterlacing Method
Su et al. Image Denoising Algorithm Based on Multi-Scale Fusion and Adaptive Attention Mechanism
Han et al. Dual discriminators generative adversarial networks for unsupervised infrared super-resolution
Fkih et al. Super-Resolution of UAVs Thermal Images Guided by Visible Images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant