CN112381866B - Attention mechanism-based video bit enhancement method - Google Patents

Attention mechanism-based video bit enhancement method Download PDF

Info

Publication number
CN112381866B
CN112381866B CN202011166047.7A CN202011166047A CN112381866B CN 112381866 B CN112381866 B CN 112381866B CN 202011166047 A CN202011166047 A CN 202011166047A CN 112381866 B CN112381866 B CN 112381866B
Authority
CN
China
Prior art keywords
video
bit depth
frames
multiplied
attention mechanism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011166047.7A
Other languages
Chinese (zh)
Other versions
CN112381866A (en
Inventor
刘婧
杨紫雯
于洁潇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202011166047.7A priority Critical patent/CN112381866B/en
Publication of CN112381866A publication Critical patent/CN112381866A/en
Application granted granted Critical
Publication of CN112381866B publication Critical patent/CN112381866B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N17/00Diagnosis, testing or measuring for television systems or their details
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20172Image enhancement details

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Processing (AREA)

Abstract

A video bit enhancement method based on an attention mechanism comprises the following steps: establishing a video bit enhancement model based on an attention mechanism; randomly selecting a set number of original video sequence groups with high bit depth from an image enhancement database to construct a training data set; training a video bit enhancement model based on an attention mechanism by using the constructed training data set; selecting a video sequence group from an image enhancement database to form a test set, and testing the trained video bit enhancement model based on the attention mechanism; and applying a zero filling algorithm to the video signal to be enhanced to obtain a high bit depth video signal, sequentially inputting the high bit depth video signal to a tested video bit enhancement model based on the attention mechanism by taking 5 frames as a group, and adding the output result of the high bit depth video signal to the intermediate frames of the correspondingly input video sequence group so as to sequentially correspond to obtain enhanced intermediate frames. The invention generates the semantic attention matrix related to the target characteristic diagram at the characteristic level, thereby improving the perception visual quality.

Description

Attention mechanism-based video bit enhancement method
Technical Field
The invention relates to a video bit enhancement method. In particular to a video bit enhancement method based on an attention mechanism.
Background
Multimedia resources such as images and videos bear abundant information, and people can quickly know things happening outside through the images and videos. Since the advent of video recording devices and display devices, efforts have been directed to obtaining and displaying higher quality images and videos. In pursuit of better visual experience, a High Dynamic Range (HDR) technique has been proposed, which uses a higher Dynamic Range and a greater bit depth (typically 10 or 12 bits) to represent one pixel. Images and videos with high dynamic range may exhibit richer colors, finer color transitions, and more realistic texture details. With the development of technology, ultra-high definition displays and HDR displays are becoming popular choices. However, the vast majority of images and videos previously captured with older camcorders have only 8 bits of bit depth, and when they are presented on HDR displays, false contours and color distortions occur [1] And the like, which are not friendly to the visual experience of people. Thus, bit depth enhancement of low bit depth images and video is useful for enhancing the human sensory experienceHas very important significance and value.
Early Bit depth enhancement methods such as Zero Padding (ZP), ideal Gain Multiplication (MIG), and Bit Replication (BR) [2] And the bit enhancement method is based on independent pixels, and although the calculation is simple and quick, the false contour effect is still obvious. Later, some differential-based methods were proposed, such as the Contour Region Reconstruction algorithm (CRR) [3] Content Adaptive Image Bit-Depth enhancement algorithm (CA) [4] And Adaptive inverse quantization algorithm (IPAD) using luminance Potential energy [5] And the like. The method considers the context information around the pixel and can better eliminate the false contour effect, but the image contents reconstructed by the method have the phenomena of blurring, detail loss and the like. In recent years, neural networks have achieved remarkable success in the field of computer vision, and have demonstrated strong learning and adaptive abilities for specific tasks. Therefore, the deep learning is also introduced into the field of Bit Depth Enhancement, and the image Bit Depth Enhancement algorithm (BE-CNN) based on the Convolutional Neural Network [6] By cascading the Bit depth Enhancement algorithm (BE-CALF: bit-depth Enhancement by Concatenating All Level Features of DNN) [7] And a Learning-based bit-depth enhancement method (BitNet: learning-based bit-depth expansion) [8] All have better performance.
The bit enhancement methods are all image-oriented, if the bit enhancement methods are applied to a video sequence with low bit depth, redundant information of front and rear frames of a video cannot be well utilized, and the generated high-bit video sequence has the phenomenon of inter-frame flicker and the like.
Disclosure of Invention
The invention aims to solve the technical problem of providing a video bit enhancement method based on an attention mechanism, which can quickly reconstruct a high-bit intermediate frame with better subjective quality and objective quality.
The technical scheme adopted by the invention is as follows: a video bit enhancement method based on an attention mechanism comprises the following steps:
1) Firstly, the bit depth of a video signal to be enhanced is called low bit depth, the bit depth of the enhanced video signal is called high bit depth, the difference between a high bit depth image and a high bit depth image obtained by applying a zero filling algorithm to the low bit depth image is called a residual image, and a video bit enhancement model based on an attention mechanism is established;
2) Randomly selecting a set number of original video sequence groups with high bit depth from an image enhancement database to construct a training data set;
3) Training a video bit enhancement model based on an attention mechanism by using the constructed training data set;
4) Selecting a video sequence group from an image enhancement database to form a test set, and testing the trained video bit enhancement model based on the attention mechanism;
5) The method comprises the steps of applying a zero padding algorithm to a video signal to be enhanced to obtain a high bit depth video signal, sequentially inputting the high bit depth video signal obtained by applying the zero padding algorithm to a tested video bit enhancement model based on the attention mechanism by taking 5 frames as a group, and adding an output result of the video bit enhancement model based on the attention mechanism and an intermediate frame of a video sequence group correspondingly input to sequentially correspond to obtain enhanced intermediate frames.
The video bit enhancement method based on the attention mechanism has the advantages that:
1. the invention takes a coder-decoder network as a framework of the network, a global attention alignment module is added before the coder network, and the module can calculate the correlation among video sequence frames to generate an attention diagram, amplify characteristic points with high correlation and carry out video alignment implicitly.
2. According to the invention, a semantic attention module guided by a target is added between an encoder and a decoder network, and the module takes a feature map of a target frame as guidance to generate a semantic attention matrix related to the target feature map at a feature level, so that the perception visual quality is improved.
Drawings
FIG. 1 is a block diagram of a video bit enhancement method based on an attention mechanism according to the present invention;
FIG. 2 is a network overall framework;
FIG. 3 is a global attention alignment module;
FIG. 4 is a semantic attention module for target guidance.
Detailed Description
The following provides a detailed description of a video bit enhancement method based on attention mechanism in accordance with the present invention with reference to the following embodiments and accompanying drawings.
As shown in fig. 1, a video bit enhancement method based on attention mechanism of the present invention includes the following steps:
1) Firstly, the bit depth of a video signal to be enhanced is called low bit depth, the bit depth of the enhanced video signal is called high bit depth, the difference between a high bit depth image and a high bit depth image obtained by applying a zero filling algorithm to the low bit depth image is called a residual image, and a video bit enhancement model based on an attention mechanism is established;
the video bit enhancement model based on the attention mechanism comprises the following components connected in sequence: global attention alignment module 1, encoder 2, target-guided semantic attention module 3 and decoder 4, wherein,
the input end of the global attention alignment module 1 receives 5 continuous video frames for capturing long-distance dependence between frames and in frames, and outputs the 5 continuous video frames after implicit alignment;
the encoder 2 receives 5 continuous video frames after implicit alignment, respectively and simultaneously extracts spatial features for each frame, and respectively outputs a feature map containing intra-frame spatial information of the corresponding frame;
the target-guided semantic attention module 3 receives the 5 feature maps output by the encoder 2, performs spatiotemporal feature fusion to obtain a feature map containing spatiotemporal feature information, and acquires feature information similar to the feature map of the intermediate frame output by the encoder 2 from the feature map and outputs the feature information to the decoder 4;
the decoder 4 reconstructs the received feature information step by step into a residual map.
Wherein the content of the first and second substances,
as shown in fig. 3, the global attention alignment module 1 includes:
(1.1) cascading 5 consecutive video frames in the channel direction to obtain a signal with dimension TC × H × W, expressed as
Figure BDA0002745813920000031
Wherein T represents the number of consecutive frames, C represents the number of channels per frame, H, W represents the height and width of the input video frame;
(1.2) mixing
Figure BDA0002745813920000032
Respectively sending into 31 × 1 convolution kernels for linear transformation to obtain linearly transformed signals
Figure BDA0002745813920000033
Then will be
Figure BDA0002745813920000034
Rearranged into a two-dimensional matrix of dimensions TC × HW, denoted
Figure BDA0002745813920000035
Superscript 2 indicates that the feature map dimension is 2;
(1.3) pairs
Figure BDA0002745813920000036
The transformation is performed by the following formula:
Figure BDA0002745813920000037
Figure BDA0002745813920000038
wherein the content of the first and second substances,
Figure BDA0002745813920000039
representation matrix multiplication, (.) T Transposing the matrix by the table;
Figure BDA00027458139200000310
obtained by
Figure BDA00027458139200000311
And
Figure BDA00027458139200000312
the similarity matrix of (a) is determined,
Figure BDA00027458139200000313
representing after weighted summation
Figure BDA00027458139200000314
Dimension is HW × TC; will be provided with
Figure BDA00027458139200000315
Transpose and then rearrange into a matrix of dimensions TC H W, noted
Figure BDA00027458139200000316
(1.4) mixing
Figure BDA00027458139200000317
Rearranging the video frames to T multiplied by C multiplied by H multiplied by W dimension through a convolution kernel of 1 multiplied by 1, and then carrying out residual error connection with the input 5 continuous video frames to obtain the 5 continuous video frames after implicit alignment.
As shown in fig. 2, the encoder 2 includes 5 convolution branches corresponding to 5 consecutive video frames, each convolution branch is formed by sequentially connecting 5 convolution layers in series, and each convolution layer includes a 3 × 3 convolution kernel and a prime lu activation function.
As shown in fig. 4, the target-guided semantic attention module 3 includes:
(3.1) receiving 5 feature maps output by the encoder 2, wherein the dimension of each feature map is Ch multiplied by H multiplied by W, ch represents the number of channels of each feature map, H, W represents the height and width of the feature map, and the 5 feature maps are cascaded in the channel direction to become the feature map with the dimension of 5Ch multiplied by H multiplied by W;
(3.2) then further fusing the spatio-temporal information through a convolution kernel of 3 x 3 to obtain a new feature map
Figure BDA00027458139200000318
The dimension of the characteristic diagram is Ch multiplied by H multiplied by W;
(3.3) mapping the new features
Figure BDA00027458139200000319
Rearranged into a two-dimensional matrix, denoted
Figure BDA00027458139200000320
The dimension is Ch × HW, and the intermediate feature map of the 5 feature maps received from the encoder 2 is
Figure BDA00027458139200000321
And rearranged into a two-dimensional matrix, denoted
Figure BDA00027458139200000322
Dimension is Ch × HW;
(3.4) pairs
Figure BDA00027458139200000323
And
Figure BDA00027458139200000324
the following operations were carried out:
Figure BDA00027458139200000325
Figure BDA0002745813920000041
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002745813920000042
representation matrix multiplication, (.) T The table transposes the matrix;
Figure BDA0002745813920000043
obtained by
Figure BDA0002745813920000044
And
Figure BDA0002745813920000045
the similarity matrix of (a) is determined,
Figure BDA0002745813920000046
representing after weighted summation
Figure BDA0002745813920000047
The dimension is HW multiplied by Ch, and the dimension is rearranged into Ch multiplied by H multiplied by W after transposition and is marked as
Figure BDA0002745813920000048
Representing after weighted summation
Figure BDA0002745813920000049
(3.5) mixing
Figure BDA00027458139200000410
And
Figure BDA00027458139200000411
after residual connection, a convolution kernel of 3 x 3 is sent to extract features.
As shown in fig. 1, the decoder 4 is formed by sequentially connecting 5 transposed convolutional layers in series, each of which contains a transposed convolutional kernel and a prellu activation function, wherein the input of the second transposed convolutional layer is the sum of the output of the first transposed convolutional layer and the output of the fourth convolutional layer in each branch of the encoder 2, and the input of the fourth transposed convolutional layer is the sum of the output of the third convolutional layer and the output of the second convolutional layer in each branch of the encoder 2.
2) Randomly selecting a set number of original video sequence groups with high bit depth from an image enhancement database to construct a training data set; the method comprises the steps of quantizing original video sequence groups to low bit depth, wherein each video sequence group comprises 5 continuous video frames, applying a zero filling algorithm to the video sequence groups with the low bit depth to expand the video sequence groups into video sequences with high bit depth, and subtracting intermediate frames of the original video sequence groups from intermediate frames of the video sequences with the high bit depth expanded by the zero filling algorithm to obtain a real residual error image to form a training data set.
3) Training a video bit enhancement model based on an attention mechanism by using the constructed training data set; in the training, the input of the network is a video sequence which is expanded into a high bit depth by applying a zero filling algorithm to a video sequence group with low bit depth in a training data set, and the output is a residual error map; an Adam optimizer is used to optimize the video bit enhancement model based on the attention mechanism by using Mean Square Error loss (MSE) as a loss function of the residual map generated by the network and the real residual map.
4) Selecting a video sequence group from an image enhancement database to form a test set, and testing the trained video bit enhancement model based on the attention mechanism; the method comprises the steps of quantizing a video sequence group forming a test set to a low bit depth, expanding the video sequence group to a high bit depth by applying a zero filling algorithm, inputting the high bit depth video sequence to a trained attention-based video bit enhancement model to obtain a residual image of an intermediate frame predicted by the model, adding the residual image and the intermediate frame of the high bit depth video sequence expanded by the zero filling algorithm to obtain a reconstructed high bit depth intermediate frame, and evaluating the quality of the reconstructed high bit depth intermediate frame by adopting an evaluation method. The evaluation method adopts two methods of Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index (SSIM).
5) The method comprises the steps of applying a zero padding algorithm to a video signal to be enhanced to obtain a high bit depth video signal, sequentially inputting the high bit depth video signal obtained by applying the zero padding algorithm to a tested video bit enhancement model based on the attention mechanism by taking 5 frames as a group, and adding an output result of the video bit enhancement model based on the attention mechanism and an intermediate frame of a video sequence group correspondingly input to sequentially correspond to obtain enhanced intermediate frames.
Example 1
The embodiment of the invention comprises the following steps:
101: from a 16-bit Sintel database [9] Randomly selecting 1000 original groups of video sequences, quantizing each group of 5 frame video frames to 4 bit depth, applying a zero-padding algorithm to the 4 bit depth video sequence to expand the 4 bit video sequence into a 16 bit depth video sequence, and calling the 16 bit depth video frame expanded by the zero-padding algorithm as a rough high bit depth video frame;
102: in the embodiment, a coder and a decoder are used as a network basic framework, a global attention alignment module is added at the head of an encoder, and the module can capture long-distance dependence by calculating the correlation between the video sequence frame and perform implicit Motion Estimation and Motion Compensation (ME & MC); and adding a semantic attention module of target guidance at the joint of the encoder and the decoder, fusing the spatial features extracted by the encoder by the semantic attention module, and then, taking the intermediate frame as a guidance feature to be related to the fused feature map filled with the space-time semantic features to obtain a semantic attention matrix. And carrying out matrix multiplication on the semantic attention matrix and the feature map filled with the space-time semantic features to obtain a transformed feature map. The module can help the network to focus more on the information related to the target frame at the semantic level, and the perception quality is improved.
103: the coarse high bit depth video sequence is input into the network, and a residual map is generated. For original high bit depth video sequence intermediate frame and coarse high bit depth videoAnd performing difference on the inter frames to obtain a real residual error image. Using Mean Square Error loss (MSE) as a loss function for network generated residual maps and true residual maps, using an Adam optimizer [11] And optimizing the video bit enhancement model based on the attention mechanism.
104: in the testing phase, 50 sets of video sequences with 16-bit depth different from the training set are randomly selected from the Sintel dataset, and from the Tears of Steel (TOS) dataset [9] 30 groups of video sequences with 16 bit depths are selected. The test set is quantized to 4 bit depth and then back quantized to a coarse high bit depth video sequence using a zero-padding algorithm. And loading the trained model parameters to a video bit enhancement model based on an attention mechanism, then transmitting the rough high bit depth video sequence to the model to generate a residual map, and adding the residual map and the intermediate frame of the rough high bit depth video to obtain a reconstructed high bit depth map. Peak signal-to-noise Ratio (PSNR) and Structural Similarity (SSIM) are used [12] These two objective evaluation criteria evaluate the test results to verify the effectiveness of the invention.
In summary, the embodiment of the present invention designs a video bit depth enhancement method based on attention mechanism through steps 101 to 104. A global attention alignment module is introduced on a classical coding and decoding network, and a target-guided semantic attention module is added. The global attention module has the same effect as the motion estimation and the motion compensation, and can capture long-distance dependence to acquire auxiliary information which is useful for target frame reconstruction from a video sequence. The method can avoid two-stage processing of motion estimation and motion compensation, and has low computational complexity and operation time. The semantic attention module of the target guidance can generate a spatiotemporal feature map highly related to the feature map of the target frame by taking the feature map of the target frame as guidance at a semantic level. The invention can realize end-to-end video bit depth enhancement at one stage, avoids high calculation complexity of motion compensation and has better reconstruction quality.
Example 2
The following example 1 protocol was evaluated for efficacy in combination with specific experimental data, as described in detail below:
301: data composition
The test set consists of 50 sets of 16-bit depth consecutive video frames randomly drawn in the sinter database, which do not overlap with the training set, and 30 sets of 16-bit depth consecutive video frames randomly drawn in the TOS database, each set containing 5 frames of pictures.
302: evaluation criterion
The invention mainly adopts two evaluation indexes to evaluate the quality of the reconstructed high bit depth video frame:
peak Signal to Noise Ratio (PSNR) is a commonly used objective image quality assessment method for evaluating the quality of an image.
Structural Similarity Index (Structural Similarity Index, SSIM) [12] The method is an index for measuring the structural similarity of two images. The index measures the similarity of two images from three angles of brightness, contrast and structure of the images, the method is more in line with the visual characteristics of human eyes, and the subjective effect of the images can be reflected. The evaluation index ranges from 0 to 1, and the higher the score is, the more similar the reconstructed high-bit image is to the original high-bit image, and the better the reconstruction quality is.
303 comparison algorithm
The embodiment of the invention is compared with 10 bit depth enhancement algorithms, which comprise 8 traditional image bit enhancement methods, 1 image bit enhancement method based on a neural network and 1 video bit enhancement method based on the neural network.
The 8 conventional image bit enhancement methods include: 1) Zero Padding algorithm (ZP); 2) Ideal Gain product algorithm (MIG); 3) Bit Replication algorithm (Bit Replication, BR) [2] (ii) a 4) Based on the Minimum Risk Classification algorithm (MRC) [10] (ii) a 5) Contour Region Reconstruction algorithm (CRR) [3] (ii) a 6) Content Adaptive Image Bit Depth enhancement algorithm (CA) [4] (ii) a 7) After maximumA test Estimation AC Signal algorithm (Maximum a Posteriori Estimation of AC Signal, ACDC) [14] (ii) a 8) Adaptive inverse quantization algorithm (IPAD) using luminance Potential energy [5]
The image Bit Enhancement method based on the Neural Network is the image Bit Depth Enhancement algorithm (BE-CNN) based on the convolution Neural Network [6]
The Video Bit Enhancement method based on the neural network is a Video Bit Depth Enhancement algorithm (VBDE) based on a space-time Symmetric Convolutional neural network [13]
Table 1 lists the results of the test on the Sintel test set and the TOS test set of this method compared with the ten other methods. The PSNR of the method is as high as 41.5293 on the Sintel test, and SSIM reaches 0.9672, which is obviously higher than the performances of other methods. The TOS dataset is two distinct datasets, with more content differences, than the sinter dataset, and the TOS dataset contains more and more complex scenes and content. The PSNR of the method reaches 39.3155 on a TOS test set, the SSIM reaches 0.9572, and the method has good universality. This test fully demonstrates the effectiveness of the method.
TABLE 1
Figure BDA0002745813920000061
Reference documents
[1]Wan P,Au O C,Tang K,et al.From 2d extrapolation to 1d interpolation:Content adaptive image bit-depth expansion[C]//2012IEEE International Conference on Multimedia and Expo.IEEE,2012:170-175..
[2]Ulichney R A,Cheung S.Pixel bit-depth increase by bit replication[C]//Color Imaging:Device-Independent Color,Color Hardcopy,and Graphic Arts III.International Society for Optics and Photonics,1998,3300:232-241.
[3]Cheng C H,Au O C,Liu C H,et al.Bit-depth expansion by contour region reconstruction[C]//2009IEEE International Symposium on Circuits and Systems.IEEE,2009:944-947.
[4]Wan P,Au O C,Tang K,et al.From 2d extrapolation to 1d interpolation:Content adaptive image bit-depth expansion[C]//2012IEEE International Conference on Multimedia and Expo.IEEE,2012:170-175.
[5]Liu J,Zhai G,Liu A,et al.IPAD:Intensity potential for adaptive de-quantization[J].IEEE Transactions on Image Processing,2018,27(10):4860-4872.
[6]Liu J,Sun W,Liu Y.Bit-depth enhancement via convolutional neural network[C]//International Forum on Digital TV and Wireless Multimedia Communications.Springer,Singapore,2017:255-264.
[7]Liu J,Sun W,Su Y,et al.BE-CALF:bit-depth enhancement by concatenating all level features of DNN[J].IEEE Transactions on Image Processing,2019,28(10):4926-4940.
[8]Byun J,Shim K,Kim C.BitNet:Learning-Based Bit-Depth Expansion[C]//Asian Conference on Computer Vision.Springer,Cham,2018:67-82.
[9]Foundation X.Xiph.Org,https://www.xiph.org/,2016.
[10]Mittal G,Jakhetiya V,Jaiswal S P,et al.Bit-depth expansion using minimum risk based classification[C]//2012Visual Communications and Image Processing.IEEE,2012:1-5.
[11]Kingma D P,Ba J.Adam:A method for stochastic optimization[J].arXiv preprint arXiv:1412.6980,2014.[12]ZEILER M D,KRISHNAN D,TAYLOR G W,et al.Deconvolutional networks;proceedings of the Computer Vision and Pattern Recognition,F,2010[C].
[12]Wang Z,Bovik AC,Sheikh H R,et al.Image quality assessment:from error visibility to structural similarity[J].IEEE transactions on image processing,2004,13(4):600-612.
[13]Liu J,Liu P,Su Y,et al.Spatiotemporal symmetric convolutional neural network for video bit-depth enhancement[J].IEEE Transactions on Multimedia,2019,21(9):2397-2406.
[14]Wan P,Cheung G,Florencio D,et al.Image bit-depth enhancement via maximum a posteriori estimation of AC signal[J].IEEE Transactions on Image Processing,2016,25(6):2896-2909.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-mentioned serial numbers of the embodiments of the present invention are only for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (7)

1. A video bit enhancement method based on an attention mechanism is characterized by comprising the following steps:
1) Firstly, the bit depth of a video signal to be enhanced is called low bit depth, the bit depth of the enhanced video signal is called high bit depth, the difference between a high bit depth image and a high bit depth image obtained by applying a zero filling algorithm to the low bit depth image is called a residual image, and a video bit enhancement model based on an attention mechanism is established;
the video bit enhancement model based on the attention mechanism comprises the following components connected in sequence: a global attention alignment module (1), an encoder (2), a target-guided semantic attention module (3) and a decoder (4), wherein,
the input end of the global attention alignment module (1) receives 5 continuous video frames, is used for capturing long-distance dependence between frames and in frames, and outputs the 5 continuous video frames after implicit alignment;
the encoder (2) receives 5 continuous video frames after implicit alignment, simultaneously extracts spatial features of each frame respectively, and outputs a feature map containing intra-frame spatial information of the corresponding frame respectively;
the target-guided semantic attention module (3) receives 5 feature maps output by the encoder (2), performs space-time feature fusion to obtain a feature map containing space-time feature information, acquires feature information similar to the feature map of the intermediate frame output by the encoder (2) from the feature map and outputs the feature information to the decoder (4);
the decoder (4) gradually reconstructs the received characteristic information into a residual error map;
the global attention alignment module (1) comprises:
(1.1) cascading 5 consecutive video frames in the channel direction to obtain a signal with dimension TC × H × W, denoted as
Figure FDA0003831406330000011
Wherein T represents the number of consecutive frames, C represents the number of channels per frame, H, W represents the height and width of the input video frame;
(1.2) mixing
Figure FDA0003831406330000012
Respectively sending into 31 × 1 convolution kernels for linear transformation to obtain linearly transformed signals
Figure FDA0003831406330000013
Then will be
Figure FDA0003831406330000014
Rearranged into a two-dimensional matrix having dimensions TC × HW, denoted
Figure FDA0003831406330000015
Superscript 2 indicates that the dimension of the two-dimensional matrix is 2;
(1.3) pairs
Figure FDA0003831406330000016
The transformation is performed by the following formula:
Figure FDA0003831406330000017
Figure FDA0003831406330000018
wherein the content of the first and second substances,
Figure FDA0003831406330000019
represents a matrix multiplication, (·) T Transposing the matrix by the table;
Figure FDA00038314063300000110
obtained by
Figure FDA00038314063300000111
And
Figure FDA00038314063300000112
the similarity matrix of (a) is determined,
Figure FDA00038314063300000113
representing after weighted summation
Figure FDA00038314063300000114
Dimension is HW × TC; will be provided with
Figure FDA00038314063300000115
Transpose and then rearrange into a matrix of dimensions TC H W, noted
Figure FDA00038314063300000116
(1.4) mixing
Figure FDA00038314063300000117
Rearranging the video frames to T multiplied by C multiplied by H multiplied by W dimensionality through a 1 multiplied by 1 convolution kernel, and then carrying out residual error connection with the input 5 continuous video frames to obtain 5 continuous video frames after implicit alignment;
the target-guided semantic attention module (3) comprises:
(3.1) receiving 5 feature maps output by the encoder (2), wherein the dimension of each feature map is Ch multiplied by H multiplied by W, ch represents the number of channels of each feature map, H, W represents the height and width of the feature maps, and the 5 feature maps are cascaded in the channel direction to become the feature map with the dimension of 5Ch multiplied by H multiplied by W;
(3.2) then further fusing the spatio-temporal information through a convolution kernel of 3 x 3 to obtain a new feature map
Figure FDA00038314063300000118
The dimension of the characteristic diagram is Ch multiplied by H multiplied by W;
(3.3) mapping the new features
Figure FDA0003831406330000021
Rearranged into a two-dimensional matrix, denoted
Figure FDA0003831406330000022
The dimension is Ch multiplied by HW, and the middle characteristic diagram of the 5 characteristic diagrams received from the encoder (2) is
Figure FDA0003831406330000023
And rearranged into a two-dimensional matrix, denoted
Figure FDA0003831406330000024
Dimension is Ch × HW;
(3.4) pairs
Figure FDA0003831406330000025
And
Figure FDA0003831406330000026
the following operations were carried out:
Figure FDA0003831406330000027
Figure FDA0003831406330000028
wherein the content of the first and second substances,
Figure FDA0003831406330000029
representation matrix multiplication, (.) T Transposing the matrix by the table;
Figure FDA00038314063300000210
obtained by
Figure FDA00038314063300000211
And
Figure FDA00038314063300000212
the similarity matrix of (a) is obtained,
Figure FDA00038314063300000213
representing after weighted summation
Figure FDA00038314063300000214
The dimension is HW multiplied by Ch, and the dimension is rearranged into Ch multiplied by H multiplied by W after transposition and is marked as
Figure FDA00038314063300000215
Representing after weighted summation
Figure FDA00038314063300000216
(3.5) mixing
Figure FDA00038314063300000217
And
Figure FDA00038314063300000218
after residual error connection, sending the residual error connection into a convolution kernel of 3 multiplied by 3 to extract features;
2) Randomly selecting a set number of original video sequence groups with high bit depth from an image enhancement database to construct a training data set;
3) Training a video bit enhancement model based on an attention mechanism by using the constructed training data set;
4) Selecting a video sequence group from an image enhancement database to form a test set, and testing the trained video bit enhancement model based on the attention mechanism;
5) The method comprises the steps of applying a zero padding algorithm to a video signal to be enhanced to obtain a high bit depth video signal, sequentially inputting the high bit depth video signal obtained by applying the zero padding algorithm to a tested video bit enhancement model based on the attention mechanism by taking 5 frames as a group, and adding an output result of the video bit enhancement model based on the attention mechanism and an intermediate frame of a video sequence group correspondingly input to sequentially correspond to obtain enhanced intermediate frames.
2. The method according to claim 1, wherein the encoder (2) comprises 5 convolution branches corresponding to 5 consecutive video frames, each convolution branch comprising 5 convolutional layers connected in series, each convolutional layer comprising a 3 x 3 convolutional kernel and a PReLU activation function connected in series.
3. The method according to claim 1, wherein the decoder (4) is formed by sequentially concatenating 5 transposed convolutional layers, each of which comprises a transposed convolutional kernel and a PReLU activation function, wherein the input of the second transposed convolutional layer is the sum of the output of the first transposed convolutional layer and the output of the fourth convolutional layer in each branch of the encoder (2), and the input of the fourth transposed convolutional layer is the sum of the output of the third convolutional layer and the output of the second convolutional layer in each branch of the encoder (2).
4. The method as claimed in claim 1, wherein the step 2) comprises quantizing original video sequence groups to low bit depth, wherein each video sequence group comprises 5 consecutive video frames, applying zero-padding algorithm to the low bit depth video sequence groups to expand the video sequence to high bit depth, and subtracting the intermediate frames of the original video sequence groups from the intermediate frames of the high bit depth video sequence expanded by the zero-padding algorithm to obtain the real residual map to form the training data set.
5. The method according to claim 1, wherein in the training of step 3), the input of the network is a video sequence in the training data set that is extended to a high bit depth by applying a zero padding algorithm to the video sequence group with a low bit depth, and the output is a residual map; and optimizing the video bit enhancement model based on the attention mechanism by using an Adam optimizer by adopting the mean square error loss as a loss function of the residual map generated by the network and the real residual map.
6. The method as claimed in claim 1, wherein the step 4) includes quantizing the group of video sequences constituting the test set to a low bit depth, expanding the video sequences to a high bit depth by applying a zero-padding algorithm, inputting the video sequences with the high bit depth into a trained video bit enhancement model based on the attention mechanism to obtain a residual map of an intermediate frame predicted by the model, adding the residual map to the intermediate frame of the video sequences with the bit depth expanded by the zero-padding algorithm to obtain a reconstructed intermediate frame with the high bit depth, and evaluating the quality of the reconstructed intermediate frame with the high bit depth by using an evaluation method.
7. The method of claim 6, wherein the evaluation method is a peak SNR and a structural similarity index.
CN202011166047.7A 2020-10-27 2020-10-27 Attention mechanism-based video bit enhancement method Active CN112381866B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011166047.7A CN112381866B (en) 2020-10-27 2020-10-27 Attention mechanism-based video bit enhancement method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011166047.7A CN112381866B (en) 2020-10-27 2020-10-27 Attention mechanism-based video bit enhancement method

Publications (2)

Publication Number Publication Date
CN112381866A CN112381866A (en) 2021-02-19
CN112381866B true CN112381866B (en) 2022-12-13

Family

ID=74576777

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011166047.7A Active CN112381866B (en) 2020-10-27 2020-10-27 Attention mechanism-based video bit enhancement method

Country Status (1)

Country Link
CN (1) CN112381866B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066022B (en) * 2021-03-17 2022-08-16 天津大学 Video bit enhancement method based on efficient space-time information fusion
CN113313682B (en) * 2021-05-28 2023-03-21 西安电子科技大学 No-reference video quality evaluation method based on space-time multi-scale analysis
CN113507607B (en) * 2021-06-11 2023-05-26 电子科技大学 Compressed video multi-frame quality enhancement method without motion compensation
CN113450280A (en) * 2021-07-07 2021-09-28 电子科技大学 Method for enhancing quality of compressed video by fusing space-time information from coarse to fine
CN114582029B (en) * 2022-05-06 2022-08-02 山东大学 Non-professional dance motion sequence enhancement method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111008938A (en) * 2019-11-25 2020-04-14 天津大学 Real-time multi-frame bit enhancement method based on content and continuity guidance
CN111031315A (en) * 2019-11-18 2020-04-17 复旦大学 Compressed video quality enhancement method based on attention mechanism and time dependency

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111031315A (en) * 2019-11-18 2020-04-17 复旦大学 Compressed video quality enhancement method based on attention mechanism and time dependency
CN111008938A (en) * 2019-11-25 2020-04-14 天津大学 Real-time multi-frame bit enhancement method based on content and continuity guidance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Spatiotemporal symmetric convolutional neural network for video bit-depth enhancement;Jing Liu et al.;《IEEE TRANSACTIONS ON MULTIMEDIA》;20190930;第21卷(第9期);全文 *

Also Published As

Publication number Publication date
CN112381866A (en) 2021-02-19

Similar Documents

Publication Publication Date Title
CN112381866B (en) Attention mechanism-based video bit enhancement method
Liang et al. Vrt: A video restoration transformer
CN111028150B (en) Rapid space-time residual attention video super-resolution reconstruction method
CN109309834B (en) Video compression method based on convolutional neural network and HEVC compression domain significant information
CN106709875B (en) Compressed low-resolution image restoration method based on joint depth network
CN111008938B (en) Real-time multi-frame bit enhancement method based on content and continuity guidance
CN111260560B (en) Multi-frame video super-resolution method fused with attention mechanism
CN113066022B (en) Video bit enhancement method based on efficient space-time information fusion
CN110751597B (en) Video super-resolution method based on coding damage repair
CN111784578A (en) Image processing method, image processing device, model training method, model training device, image processing equipment and storage medium
CN110852964A (en) Image bit enhancement method based on deep learning
US20110317916A1 (en) Method and system for spatial-temporal denoising and demosaicking for noisy color filter array videos
WO2023000179A1 (en) Video super-resolution network, and video super-resolution, encoding and decoding processing method and device
CN110889895A (en) Face video super-resolution reconstruction method fusing single-frame reconstruction network
CN110796622A (en) Image bit enhancement method based on multi-layer characteristics of series neural network
Agustsson et al. Extreme learned image compression with gans
CN114757828A (en) Transformer-based video space-time super-resolution method
CN113850718A (en) Video synchronization space-time super-resolution method based on inter-frame feature alignment
CN114066730B (en) Video frame interpolation method based on unsupervised dual learning
CN113592746A (en) Method for enhancing quality of compressed video by fusing space-time information from coarse to fine
CN112862675A (en) Video enhancement method and system for space-time super-resolution
Liu et al. Gated context model with embedded priors for deep image compression
Li et al. Extreme underwater image compression using physical priors
CN116012272A (en) Compressed video quality enhancement method based on reconstructed flow field
Zhang et al. SPQE: Structure-and-Perception-Based Quality Evaluation for Image Super-Resolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant