CN112381866B - Attention mechanism-based video bit enhancement method - Google Patents
Attention mechanism-based video bit enhancement method Download PDFInfo
- Publication number
- CN112381866B CN112381866B CN202011166047.7A CN202011166047A CN112381866B CN 112381866 B CN112381866 B CN 112381866B CN 202011166047 A CN202011166047 A CN 202011166047A CN 112381866 B CN112381866 B CN 112381866B
- Authority
- CN
- China
- Prior art keywords
- video
- bit depth
- frames
- multiplied
- attention mechanism
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000007246 mechanism Effects 0.000 title claims abstract description 38
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 39
- 239000011159 matrix material Substances 0.000 claims abstract description 26
- 238000012360 testing method Methods 0.000 claims abstract description 21
- 238000012549 training Methods 0.000 claims abstract description 20
- 238000010586 diagram Methods 0.000 claims abstract description 7
- 238000011156 evaluation Methods 0.000 claims description 8
- 238000002156 mixing Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 2
- 230000004927 fusion Effects 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims description 2
- 230000017105 transposition Effects 0.000 claims description 2
- 230000000007 visual effect Effects 0.000 abstract description 5
- 230000008447 perception Effects 0.000 abstract description 3
- 238000013528 artificial neural network Methods 0.000 description 6
- 230000003044 adaptive effect Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000010076 replication Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000005381 potential energy Methods 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- 102100031786 Adiponectin Human genes 0.000 description 1
- 101000775469 Homo sapiens Adiponectin Proteins 0.000 description 1
- 229910000831 Steel Inorganic materials 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000001303 quality assessment method Methods 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000010959 steel Substances 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N17/00—Diagnosis, testing or measuring for television systems or their details
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20172—Image enhancement details
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Image Processing (AREA)
Abstract
A video bit enhancement method based on an attention mechanism comprises the following steps: establishing a video bit enhancement model based on an attention mechanism; randomly selecting a set number of original video sequence groups with high bit depth from an image enhancement database to construct a training data set; training a video bit enhancement model based on an attention mechanism by using the constructed training data set; selecting a video sequence group from an image enhancement database to form a test set, and testing the trained video bit enhancement model based on the attention mechanism; and applying a zero filling algorithm to the video signal to be enhanced to obtain a high bit depth video signal, sequentially inputting the high bit depth video signal to a tested video bit enhancement model based on the attention mechanism by taking 5 frames as a group, and adding the output result of the high bit depth video signal to the intermediate frames of the correspondingly input video sequence group so as to sequentially correspond to obtain enhanced intermediate frames. The invention generates the semantic attention matrix related to the target characteristic diagram at the characteristic level, thereby improving the perception visual quality.
Description
Technical Field
The invention relates to a video bit enhancement method. In particular to a video bit enhancement method based on an attention mechanism.
Background
Multimedia resources such as images and videos bear abundant information, and people can quickly know things happening outside through the images and videos. Since the advent of video recording devices and display devices, efforts have been directed to obtaining and displaying higher quality images and videos. In pursuit of better visual experience, a High Dynamic Range (HDR) technique has been proposed, which uses a higher Dynamic Range and a greater bit depth (typically 10 or 12 bits) to represent one pixel. Images and videos with high dynamic range may exhibit richer colors, finer color transitions, and more realistic texture details. With the development of technology, ultra-high definition displays and HDR displays are becoming popular choices. However, the vast majority of images and videos previously captured with older camcorders have only 8 bits of bit depth, and when they are presented on HDR displays, false contours and color distortions occur [1] And the like, which are not friendly to the visual experience of people. Thus, bit depth enhancement of low bit depth images and video is useful for enhancing the human sensory experienceHas very important significance and value.
Early Bit depth enhancement methods such as Zero Padding (ZP), ideal Gain Multiplication (MIG), and Bit Replication (BR) [2] And the bit enhancement method is based on independent pixels, and although the calculation is simple and quick, the false contour effect is still obvious. Later, some differential-based methods were proposed, such as the Contour Region Reconstruction algorithm (CRR) [3] Content Adaptive Image Bit-Depth enhancement algorithm (CA) [4] And Adaptive inverse quantization algorithm (IPAD) using luminance Potential energy [5] And the like. The method considers the context information around the pixel and can better eliminate the false contour effect, but the image contents reconstructed by the method have the phenomena of blurring, detail loss and the like. In recent years, neural networks have achieved remarkable success in the field of computer vision, and have demonstrated strong learning and adaptive abilities for specific tasks. Therefore, the deep learning is also introduced into the field of Bit Depth Enhancement, and the image Bit Depth Enhancement algorithm (BE-CNN) based on the Convolutional Neural Network [6] By cascading the Bit depth Enhancement algorithm (BE-CALF: bit-depth Enhancement by Concatenating All Level Features of DNN) [7] And a Learning-based bit-depth enhancement method (BitNet: learning-based bit-depth expansion) [8] All have better performance.
The bit enhancement methods are all image-oriented, if the bit enhancement methods are applied to a video sequence with low bit depth, redundant information of front and rear frames of a video cannot be well utilized, and the generated high-bit video sequence has the phenomenon of inter-frame flicker and the like.
Disclosure of Invention
The invention aims to solve the technical problem of providing a video bit enhancement method based on an attention mechanism, which can quickly reconstruct a high-bit intermediate frame with better subjective quality and objective quality.
The technical scheme adopted by the invention is as follows: a video bit enhancement method based on an attention mechanism comprises the following steps:
1) Firstly, the bit depth of a video signal to be enhanced is called low bit depth, the bit depth of the enhanced video signal is called high bit depth, the difference between a high bit depth image and a high bit depth image obtained by applying a zero filling algorithm to the low bit depth image is called a residual image, and a video bit enhancement model based on an attention mechanism is established;
2) Randomly selecting a set number of original video sequence groups with high bit depth from an image enhancement database to construct a training data set;
3) Training a video bit enhancement model based on an attention mechanism by using the constructed training data set;
4) Selecting a video sequence group from an image enhancement database to form a test set, and testing the trained video bit enhancement model based on the attention mechanism;
5) The method comprises the steps of applying a zero padding algorithm to a video signal to be enhanced to obtain a high bit depth video signal, sequentially inputting the high bit depth video signal obtained by applying the zero padding algorithm to a tested video bit enhancement model based on the attention mechanism by taking 5 frames as a group, and adding an output result of the video bit enhancement model based on the attention mechanism and an intermediate frame of a video sequence group correspondingly input to sequentially correspond to obtain enhanced intermediate frames.
The video bit enhancement method based on the attention mechanism has the advantages that:
1. the invention takes a coder-decoder network as a framework of the network, a global attention alignment module is added before the coder network, and the module can calculate the correlation among video sequence frames to generate an attention diagram, amplify characteristic points with high correlation and carry out video alignment implicitly.
2. According to the invention, a semantic attention module guided by a target is added between an encoder and a decoder network, and the module takes a feature map of a target frame as guidance to generate a semantic attention matrix related to the target feature map at a feature level, so that the perception visual quality is improved.
Drawings
FIG. 1 is a block diagram of a video bit enhancement method based on an attention mechanism according to the present invention;
FIG. 2 is a network overall framework;
FIG. 3 is a global attention alignment module;
FIG. 4 is a semantic attention module for target guidance.
Detailed Description
The following provides a detailed description of a video bit enhancement method based on attention mechanism in accordance with the present invention with reference to the following embodiments and accompanying drawings.
As shown in fig. 1, a video bit enhancement method based on attention mechanism of the present invention includes the following steps:
1) Firstly, the bit depth of a video signal to be enhanced is called low bit depth, the bit depth of the enhanced video signal is called high bit depth, the difference between a high bit depth image and a high bit depth image obtained by applying a zero filling algorithm to the low bit depth image is called a residual image, and a video bit enhancement model based on an attention mechanism is established;
the video bit enhancement model based on the attention mechanism comprises the following components connected in sequence: global attention alignment module 1, encoder 2, target-guided semantic attention module 3 and decoder 4, wherein,
the input end of the global attention alignment module 1 receives 5 continuous video frames for capturing long-distance dependence between frames and in frames, and outputs the 5 continuous video frames after implicit alignment;
the encoder 2 receives 5 continuous video frames after implicit alignment, respectively and simultaneously extracts spatial features for each frame, and respectively outputs a feature map containing intra-frame spatial information of the corresponding frame;
the target-guided semantic attention module 3 receives the 5 feature maps output by the encoder 2, performs spatiotemporal feature fusion to obtain a feature map containing spatiotemporal feature information, and acquires feature information similar to the feature map of the intermediate frame output by the encoder 2 from the feature map and outputs the feature information to the decoder 4;
the decoder 4 reconstructs the received feature information step by step into a residual map.
Wherein,
as shown in fig. 3, the global attention alignment module 1 includes:
(1.1) cascading 5 consecutive video frames in the channel direction to obtain a signal with dimension TC × H × W, expressed asWherein T represents the number of consecutive frames, C represents the number of channels per frame, H, W represents the height and width of the input video frame;
(1.2) mixingRespectively sending into 31 × 1 convolution kernels for linear transformation to obtain linearly transformed signalsThen will beRearranged into a two-dimensional matrix of dimensions TC × HW, denotedSuperscript 2 indicates that the feature map dimension is 2;
wherein,representation matrix multiplication, (.) T Transposing the matrix by the table;obtained byAndthe similarity matrix of (a) is determined,representing after weighted summationDimension is HW × TC; will be provided withTranspose and then rearrange into a matrix of dimensions TC H W, noted
(1.4) mixingRearranging the video frames to T multiplied by C multiplied by H multiplied by W dimension through a convolution kernel of 1 multiplied by 1, and then carrying out residual error connection with the input 5 continuous video frames to obtain the 5 continuous video frames after implicit alignment.
As shown in fig. 2, the encoder 2 includes 5 convolution branches corresponding to 5 consecutive video frames, each convolution branch is formed by sequentially connecting 5 convolution layers in series, and each convolution layer includes a 3 × 3 convolution kernel and a prime lu activation function.
As shown in fig. 4, the target-guided semantic attention module 3 includes:
(3.1) receiving 5 feature maps output by the encoder 2, wherein the dimension of each feature map is Ch multiplied by H multiplied by W, ch represents the number of channels of each feature map, H, W represents the height and width of the feature map, and the 5 feature maps are cascaded in the channel direction to become the feature map with the dimension of 5Ch multiplied by H multiplied by W;
(3.2) then further fusing the spatio-temporal information through a convolution kernel of 3 x 3 to obtain a new feature mapThe dimension of the characteristic diagram is Ch multiplied by H multiplied by W;
(3.3) mapping the new featuresRearranged into a two-dimensional matrix, denotedThe dimension is Ch × HW, and the intermediate feature map of the 5 feature maps received from the encoder 2 isAnd rearranged into a two-dimensional matrix, denotedDimension is Ch × HW;
wherein,representation matrix multiplication, (.) T The table transposes the matrix;obtained byAndthe similarity matrix of (a) is determined,representing after weighted summationThe dimension is HW multiplied by Ch, and the dimension is rearranged into Ch multiplied by H multiplied by W after transposition and is marked asRepresenting after weighted summation
(3.5) mixingAndafter residual connection, a convolution kernel of 3 x 3 is sent to extract features.
As shown in fig. 1, the decoder 4 is formed by sequentially connecting 5 transposed convolutional layers in series, each of which contains a transposed convolutional kernel and a prellu activation function, wherein the input of the second transposed convolutional layer is the sum of the output of the first transposed convolutional layer and the output of the fourth convolutional layer in each branch of the encoder 2, and the input of the fourth transposed convolutional layer is the sum of the output of the third convolutional layer and the output of the second convolutional layer in each branch of the encoder 2.
2) Randomly selecting a set number of original video sequence groups with high bit depth from an image enhancement database to construct a training data set; the method comprises the steps of quantizing original video sequence groups to low bit depth, wherein each video sequence group comprises 5 continuous video frames, applying a zero filling algorithm to the video sequence groups with the low bit depth to expand the video sequence groups into video sequences with high bit depth, and subtracting intermediate frames of the original video sequence groups from intermediate frames of the video sequences with the high bit depth expanded by the zero filling algorithm to obtain a real residual error image to form a training data set.
3) Training a video bit enhancement model based on an attention mechanism by using the constructed training data set; in the training, the input of the network is a video sequence which is expanded into a high bit depth by applying a zero filling algorithm to a video sequence group with low bit depth in a training data set, and the output is a residual error map; an Adam optimizer is used to optimize the video bit enhancement model based on the attention mechanism by using Mean Square Error loss (MSE) as a loss function of the residual map generated by the network and the real residual map.
4) Selecting a video sequence group from an image enhancement database to form a test set, and testing the trained video bit enhancement model based on the attention mechanism; the method comprises the steps of quantizing a video sequence group forming a test set to a low bit depth, expanding the video sequence group to a high bit depth by applying a zero filling algorithm, inputting the high bit depth video sequence to a trained attention-based video bit enhancement model to obtain a residual image of an intermediate frame predicted by the model, adding the residual image and the intermediate frame of the high bit depth video sequence expanded by the zero filling algorithm to obtain a reconstructed high bit depth intermediate frame, and evaluating the quality of the reconstructed high bit depth intermediate frame by adopting an evaluation method. The evaluation method adopts two methods of Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index (SSIM).
5) The method comprises the steps of applying a zero padding algorithm to a video signal to be enhanced to obtain a high bit depth video signal, sequentially inputting the high bit depth video signal obtained by applying the zero padding algorithm to a tested video bit enhancement model based on the attention mechanism by taking 5 frames as a group, and adding an output result of the video bit enhancement model based on the attention mechanism and an intermediate frame of a video sequence group correspondingly input to sequentially correspond to obtain enhanced intermediate frames.
Example 1
The embodiment of the invention comprises the following steps:
101: from a 16-bit Sintel database [9] Randomly selecting 1000 original groups of video sequences, quantizing each group of 5 frame video frames to 4 bit depth, applying a zero-padding algorithm to the 4 bit depth video sequence to expand the 4 bit video sequence into a 16 bit depth video sequence, and calling the 16 bit depth video frame expanded by the zero-padding algorithm as a rough high bit depth video frame;
102: in the embodiment, a coder and a decoder are used as a network basic framework, a global attention alignment module is added at the head of an encoder, and the module can capture long-distance dependence by calculating the correlation between the video sequence frame and perform implicit Motion Estimation and Motion Compensation (ME & MC); and adding a semantic attention module of target guidance at the joint of the encoder and the decoder, fusing the spatial features extracted by the encoder by the semantic attention module, and then, taking the intermediate frame as a guidance feature to be related to the fused feature map filled with the space-time semantic features to obtain a semantic attention matrix. And carrying out matrix multiplication on the semantic attention matrix and the feature map filled with the space-time semantic features to obtain a transformed feature map. The module can help the network to focus more on the information related to the target frame at the semantic level, and the perception quality is improved.
103: the coarse high bit depth video sequence is input into the network, and a residual map is generated. For original high bit depth video sequence intermediate frame and coarse high bit depth videoAnd performing difference on the inter frames to obtain a real residual error image. Using Mean Square Error loss (MSE) as a loss function for network generated residual maps and true residual maps, using an Adam optimizer [11] And optimizing the video bit enhancement model based on the attention mechanism.
104: in the testing phase, 50 sets of video sequences with 16-bit depth different from the training set are randomly selected from the Sintel dataset, and from the Tears of Steel (TOS) dataset [9] 30 groups of video sequences with 16 bit depths are selected. The test set is quantized to 4 bit depth and then back quantized to a coarse high bit depth video sequence using a zero-padding algorithm. And loading the trained model parameters to a video bit enhancement model based on an attention mechanism, then transmitting the rough high bit depth video sequence to the model to generate a residual map, and adding the residual map and the intermediate frame of the rough high bit depth video to obtain a reconstructed high bit depth map. Peak signal-to-noise Ratio (PSNR) and Structural Similarity (SSIM) are used [12] These two objective evaluation criteria evaluate the test results to verify the effectiveness of the invention.
In summary, the embodiment of the present invention designs a video bit depth enhancement method based on attention mechanism through steps 101 to 104. A global attention alignment module is introduced on a classical coding and decoding network, and a target-guided semantic attention module is added. The global attention module has the same effect as the motion estimation and the motion compensation, and can capture long-distance dependence to acquire auxiliary information which is useful for target frame reconstruction from a video sequence. The method can avoid two-stage processing of motion estimation and motion compensation, and has low computational complexity and operation time. The semantic attention module of the target guidance can generate a spatiotemporal feature map highly related to the feature map of the target frame by taking the feature map of the target frame as guidance at a semantic level. The invention can realize end-to-end video bit depth enhancement at one stage, avoids high calculation complexity of motion compensation and has better reconstruction quality.
Example 2
The following example 1 protocol was evaluated for efficacy in combination with specific experimental data, as described in detail below:
301: data composition
The test set consists of 50 sets of 16-bit depth consecutive video frames randomly drawn in the sinter database, which do not overlap with the training set, and 30 sets of 16-bit depth consecutive video frames randomly drawn in the TOS database, each set containing 5 frames of pictures.
302: evaluation criterion
The invention mainly adopts two evaluation indexes to evaluate the quality of the reconstructed high bit depth video frame:
peak Signal to Noise Ratio (PSNR) is a commonly used objective image quality assessment method for evaluating the quality of an image.
Structural Similarity Index (Structural Similarity Index, SSIM) [12] The method is an index for measuring the structural similarity of two images. The index measures the similarity of two images from three angles of brightness, contrast and structure of the images, the method is more in line with the visual characteristics of human eyes, and the subjective effect of the images can be reflected. The evaluation index ranges from 0 to 1, and the higher the score is, the more similar the reconstructed high-bit image is to the original high-bit image, and the better the reconstruction quality is.
303 comparison algorithm
The embodiment of the invention is compared with 10 bit depth enhancement algorithms, which comprise 8 traditional image bit enhancement methods, 1 image bit enhancement method based on a neural network and 1 video bit enhancement method based on the neural network.
The 8 conventional image bit enhancement methods include: 1) Zero Padding algorithm (ZP); 2) Ideal Gain product algorithm (MIG); 3) Bit Replication algorithm (Bit Replication, BR) [2] (ii) a 4) Based on the Minimum Risk Classification algorithm (MRC) [10] (ii) a 5) Contour Region Reconstruction algorithm (CRR) [3] (ii) a 6) Content Adaptive Image Bit Depth enhancement algorithm (CA) [4] (ii) a 7) After maximumA test Estimation AC Signal algorithm (Maximum a Posteriori Estimation of AC Signal, ACDC) [14] (ii) a 8) Adaptive inverse quantization algorithm (IPAD) using luminance Potential energy [5] 。
The image Bit Enhancement method based on the Neural Network is the image Bit Depth Enhancement algorithm (BE-CNN) based on the convolution Neural Network [6] 。
The Video Bit Enhancement method based on the neural network is a Video Bit Depth Enhancement algorithm (VBDE) based on a space-time Symmetric Convolutional neural network [13] 。
Table 1 lists the results of the test on the Sintel test set and the TOS test set of this method compared with the ten other methods. The PSNR of the method is as high as 41.5293 on the Sintel test, and SSIM reaches 0.9672, which is obviously higher than the performances of other methods. The TOS dataset is two distinct datasets, with more content differences, than the sinter dataset, and the TOS dataset contains more and more complex scenes and content. The PSNR of the method reaches 39.3155 on a TOS test set, the SSIM reaches 0.9572, and the method has good universality. This test fully demonstrates the effectiveness of the method.
TABLE 1
Reference documents
[1]Wan P,Au O C,Tang K,et al.From 2d extrapolation to 1d interpolation:Content adaptive image bit-depth expansion[C]//2012IEEE International Conference on Multimedia and Expo.IEEE,2012:170-175..
[2]Ulichney R A,Cheung S.Pixel bit-depth increase by bit replication[C]//Color Imaging:Device-Independent Color,Color Hardcopy,and Graphic Arts III.International Society for Optics and Photonics,1998,3300:232-241.
[3]Cheng C H,Au O C,Liu C H,et al.Bit-depth expansion by contour region reconstruction[C]//2009IEEE International Symposium on Circuits and Systems.IEEE,2009:944-947.
[4]Wan P,Au O C,Tang K,et al.From 2d extrapolation to 1d interpolation:Content adaptive image bit-depth expansion[C]//2012IEEE International Conference on Multimedia and Expo.IEEE,2012:170-175.
[5]Liu J,Zhai G,Liu A,et al.IPAD:Intensity potential for adaptive de-quantization[J].IEEE Transactions on Image Processing,2018,27(10):4860-4872.
[6]Liu J,Sun W,Liu Y.Bit-depth enhancement via convolutional neural network[C]//International Forum on Digital TV and Wireless Multimedia Communications.Springer,Singapore,2017:255-264.
[7]Liu J,Sun W,Su Y,et al.BE-CALF:bit-depth enhancement by concatenating all level features of DNN[J].IEEE Transactions on Image Processing,2019,28(10):4926-4940.
[8]Byun J,Shim K,Kim C.BitNet:Learning-Based Bit-Depth Expansion[C]//Asian Conference on Computer Vision.Springer,Cham,2018:67-82.
[9]Foundation X.Xiph.Org,https://www.xiph.org/,2016.
[10]Mittal G,Jakhetiya V,Jaiswal S P,et al.Bit-depth expansion using minimum risk based classification[C]//2012Visual Communications and Image Processing.IEEE,2012:1-5.
[11]Kingma D P,Ba J.Adam:A method for stochastic optimization[J].arXiv preprint arXiv:1412.6980,2014.[12]ZEILER M D,KRISHNAN D,TAYLOR G W,et al.Deconvolutional networks;proceedings of the Computer Vision and Pattern Recognition,F,2010[C].
[12]Wang Z,Bovik AC,Sheikh H R,et al.Image quality assessment:from error visibility to structural similarity[J].IEEE transactions on image processing,2004,13(4):600-612.
[13]Liu J,Liu P,Su Y,et al.Spatiotemporal symmetric convolutional neural network for video bit-depth enhancement[J].IEEE Transactions on Multimedia,2019,21(9):2397-2406.
[14]Wan P,Cheung G,Florencio D,et al.Image bit-depth enhancement via maximum a posteriori estimation of AC signal[J].IEEE Transactions on Image Processing,2016,25(6):2896-2909.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-mentioned serial numbers of the embodiments of the present invention are only for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (7)
1. A video bit enhancement method based on an attention mechanism is characterized by comprising the following steps:
1) Firstly, the bit depth of a video signal to be enhanced is called low bit depth, the bit depth of the enhanced video signal is called high bit depth, the difference between a high bit depth image and a high bit depth image obtained by applying a zero filling algorithm to the low bit depth image is called a residual image, and a video bit enhancement model based on an attention mechanism is established;
the video bit enhancement model based on the attention mechanism comprises the following components connected in sequence: a global attention alignment module (1), an encoder (2), a target-guided semantic attention module (3) and a decoder (4), wherein,
the input end of the global attention alignment module (1) receives 5 continuous video frames, is used for capturing long-distance dependence between frames and in frames, and outputs the 5 continuous video frames after implicit alignment;
the encoder (2) receives 5 continuous video frames after implicit alignment, simultaneously extracts spatial features of each frame respectively, and outputs a feature map containing intra-frame spatial information of the corresponding frame respectively;
the target-guided semantic attention module (3) receives 5 feature maps output by the encoder (2), performs space-time feature fusion to obtain a feature map containing space-time feature information, acquires feature information similar to the feature map of the intermediate frame output by the encoder (2) from the feature map and outputs the feature information to the decoder (4);
the decoder (4) gradually reconstructs the received characteristic information into a residual error map;
the global attention alignment module (1) comprises:
(1.1) cascading 5 consecutive video frames in the channel direction to obtain a signal with dimension TC × H × W, denoted asWherein T represents the number of consecutive frames, C represents the number of channels per frame, H, W represents the height and width of the input video frame;
(1.2) mixingRespectively sending into 31 × 1 convolution kernels for linear transformation to obtain linearly transformed signalsThen will beRearranged into a two-dimensional matrix having dimensions TC × HW, denotedSuperscript 2 indicates that the dimension of the two-dimensional matrix is 2;
wherein,represents a matrix multiplication, (·) T Transposing the matrix by the table;obtained byAndthe similarity matrix of (a) is determined,representing after weighted summationDimension is HW × TC; will be provided withTranspose and then rearrange into a matrix of dimensions TC H W, noted
(1.4) mixingRearranging the video frames to T multiplied by C multiplied by H multiplied by W dimensionality through a 1 multiplied by 1 convolution kernel, and then carrying out residual error connection with the input 5 continuous video frames to obtain 5 continuous video frames after implicit alignment;
the target-guided semantic attention module (3) comprises:
(3.1) receiving 5 feature maps output by the encoder (2), wherein the dimension of each feature map is Ch multiplied by H multiplied by W, ch represents the number of channels of each feature map, H, W represents the height and width of the feature maps, and the 5 feature maps are cascaded in the channel direction to become the feature map with the dimension of 5Ch multiplied by H multiplied by W;
(3.2) then further fusing the spatio-temporal information through a convolution kernel of 3 x 3 to obtain a new feature mapThe dimension of the characteristic diagram is Ch multiplied by H multiplied by W;
(3.3) mapping the new featuresRearranged into a two-dimensional matrix, denotedThe dimension is Ch multiplied by HW, and the middle characteristic diagram of the 5 characteristic diagrams received from the encoder (2) isAnd rearranged into a two-dimensional matrix, denotedDimension is Ch × HW;
wherein,representation matrix multiplication, (.) T Transposing the matrix by the table;obtained byAndthe similarity matrix of (a) is obtained,representing after weighted summationThe dimension is HW multiplied by Ch, and the dimension is rearranged into Ch multiplied by H multiplied by W after transposition and is marked asRepresenting after weighted summation
(3.5) mixingAndafter residual error connection, sending the residual error connection into a convolution kernel of 3 multiplied by 3 to extract features;
2) Randomly selecting a set number of original video sequence groups with high bit depth from an image enhancement database to construct a training data set;
3) Training a video bit enhancement model based on an attention mechanism by using the constructed training data set;
4) Selecting a video sequence group from an image enhancement database to form a test set, and testing the trained video bit enhancement model based on the attention mechanism;
5) The method comprises the steps of applying a zero padding algorithm to a video signal to be enhanced to obtain a high bit depth video signal, sequentially inputting the high bit depth video signal obtained by applying the zero padding algorithm to a tested video bit enhancement model based on the attention mechanism by taking 5 frames as a group, and adding an output result of the video bit enhancement model based on the attention mechanism and an intermediate frame of a video sequence group correspondingly input to sequentially correspond to obtain enhanced intermediate frames.
2. The method according to claim 1, wherein the encoder (2) comprises 5 convolution branches corresponding to 5 consecutive video frames, each convolution branch comprising 5 convolutional layers connected in series, each convolutional layer comprising a 3 x 3 convolutional kernel and a PReLU activation function connected in series.
3. The method according to claim 1, wherein the decoder (4) is formed by sequentially concatenating 5 transposed convolutional layers, each of which comprises a transposed convolutional kernel and a PReLU activation function, wherein the input of the second transposed convolutional layer is the sum of the output of the first transposed convolutional layer and the output of the fourth convolutional layer in each branch of the encoder (2), and the input of the fourth transposed convolutional layer is the sum of the output of the third convolutional layer and the output of the second convolutional layer in each branch of the encoder (2).
4. The method as claimed in claim 1, wherein the step 2) comprises quantizing original video sequence groups to low bit depth, wherein each video sequence group comprises 5 consecutive video frames, applying zero-padding algorithm to the low bit depth video sequence groups to expand the video sequence to high bit depth, and subtracting the intermediate frames of the original video sequence groups from the intermediate frames of the high bit depth video sequence expanded by the zero-padding algorithm to obtain the real residual map to form the training data set.
5. The method according to claim 1, wherein in the training of step 3), the input of the network is a video sequence in the training data set that is extended to a high bit depth by applying a zero padding algorithm to the video sequence group with a low bit depth, and the output is a residual map; and optimizing the video bit enhancement model based on the attention mechanism by using an Adam optimizer by adopting the mean square error loss as a loss function of the residual map generated by the network and the real residual map.
6. The method as claimed in claim 1, wherein the step 4) includes quantizing the group of video sequences constituting the test set to a low bit depth, expanding the video sequences to a high bit depth by applying a zero-padding algorithm, inputting the video sequences with the high bit depth into a trained video bit enhancement model based on the attention mechanism to obtain a residual map of an intermediate frame predicted by the model, adding the residual map to the intermediate frame of the video sequences with the bit depth expanded by the zero-padding algorithm to obtain a reconstructed intermediate frame with the high bit depth, and evaluating the quality of the reconstructed intermediate frame with the high bit depth by using an evaluation method.
7. The method of claim 6, wherein the evaluation method is a peak SNR and a structural similarity index.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011166047.7A CN112381866B (en) | 2020-10-27 | 2020-10-27 | Attention mechanism-based video bit enhancement method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011166047.7A CN112381866B (en) | 2020-10-27 | 2020-10-27 | Attention mechanism-based video bit enhancement method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112381866A CN112381866A (en) | 2021-02-19 |
CN112381866B true CN112381866B (en) | 2022-12-13 |
Family
ID=74576777
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011166047.7A Active CN112381866B (en) | 2020-10-27 | 2020-10-27 | Attention mechanism-based video bit enhancement method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112381866B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113066022B (en) * | 2021-03-17 | 2022-08-16 | 天津大学 | Video bit enhancement method based on efficient space-time information fusion |
CN113313682B (en) * | 2021-05-28 | 2023-03-21 | 西安电子科技大学 | No-reference video quality evaluation method based on space-time multi-scale analysis |
CN113507607B (en) * | 2021-06-11 | 2023-05-26 | 电子科技大学 | Compressed video multi-frame quality enhancement method without motion compensation |
CN113450280A (en) * | 2021-07-07 | 2021-09-28 | 电子科技大学 | Method for enhancing quality of compressed video by fusing space-time information from coarse to fine |
CN114582029B (en) * | 2022-05-06 | 2022-08-02 | 山东大学 | Non-professional dance motion sequence enhancement method and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111008938A (en) * | 2019-11-25 | 2020-04-14 | 天津大学 | Real-time multi-frame bit enhancement method based on content and continuity guidance |
CN111031315A (en) * | 2019-11-18 | 2020-04-17 | 复旦大学 | Compressed video quality enhancement method based on attention mechanism and time dependency |
-
2020
- 2020-10-27 CN CN202011166047.7A patent/CN112381866B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111031315A (en) * | 2019-11-18 | 2020-04-17 | 复旦大学 | Compressed video quality enhancement method based on attention mechanism and time dependency |
CN111008938A (en) * | 2019-11-25 | 2020-04-14 | 天津大学 | Real-time multi-frame bit enhancement method based on content and continuity guidance |
Non-Patent Citations (1)
Title |
---|
Spatiotemporal symmetric convolutional neural network for video bit-depth enhancement;Jing Liu et al.;《IEEE TRANSACTIONS ON MULTIMEDIA》;20190930;第21卷(第9期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112381866A (en) | 2021-02-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112381866B (en) | Attention mechanism-based video bit enhancement method | |
Liang et al. | Vrt: A video restoration transformer | |
CN111028150B (en) | Rapid space-time residual attention video super-resolution reconstruction method | |
CN109309834B (en) | Video compression method based on convolutional neural network and HEVC compression domain significant information | |
CN106709875B (en) | Compressed low-resolution image restoration method based on joint depth network | |
CN110751597B (en) | Video super-resolution method based on coding damage repair | |
CN111008938B (en) | Real-time multi-frame bit enhancement method based on content and continuity guidance | |
CN111260560B (en) | Multi-frame video super-resolution method fused with attention mechanism | |
CN113066022B (en) | Video bit enhancement method based on efficient space-time information fusion | |
CN110852964A (en) | Image bit enhancement method based on deep learning | |
WO2023000179A1 (en) | Video super-resolution network, and video super-resolution, encoding and decoding processing method and device | |
US20110317916A1 (en) | Method and system for spatial-temporal denoising and demosaicking for noisy color filter array videos | |
CN110889895A (en) | Face video super-resolution reconstruction method fusing single-frame reconstruction network | |
CN114066730B (en) | Video frame interpolation method based on unsupervised dual learning | |
Agustsson et al. | Extreme learned image compression with gans | |
CN114757828A (en) | Transformer-based video space-time super-resolution method | |
CN113592746A (en) | Method for enhancing quality of compressed video by fusing space-time information from coarse to fine | |
CN113850718A (en) | Video synchronization space-time super-resolution method based on inter-frame feature alignment | |
Zhang et al. | Spatial-temporal color video reconstruction from noisy CFA sequence | |
Liu et al. | Gated context model with embedded priors for deep image compression | |
CN112862675A (en) | Video enhancement method and system for space-time super-resolution | |
Li et al. | Extreme underwater image compression using physical priors | |
CN116012272A (en) | Compressed video quality enhancement method based on reconstructed flow field | |
CN112866668B (en) | Multi-view video reconstruction method based on GAN latent codes | |
CN114663306A (en) | Pyramid-based multi-level information fusion video bit depth enhancement method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |