CN112381866B

CN112381866B - Attention mechanism-based video bit enhancement method

Info

Publication number: CN112381866B
Application number: CN202011166047.7A
Authority: CN
Inventors: 刘婧; 杨紫雯; 于洁潇
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-10-27
Filing date: 2020-10-27
Publication date: 2022-12-13
Anticipated expiration: 2040-10-27
Also published as: CN112381866A

Abstract

A video bit enhancement method based on an attention mechanism comprises the following steps: establishing a video bit enhancement model based on an attention mechanism; randomly selecting a set number of original video sequence groups with high bit depth from an image enhancement database to construct a training data set; training a video bit enhancement model based on an attention mechanism by using the constructed training data set; selecting a video sequence group from an image enhancement database to form a test set, and testing the trained video bit enhancement model based on the attention mechanism; and applying a zero filling algorithm to the video signal to be enhanced to obtain a high bit depth video signal, sequentially inputting the high bit depth video signal to a tested video bit enhancement model based on the attention mechanism by taking 5 frames as a group, and adding the output result of the high bit depth video signal to the intermediate frames of the correspondingly input video sequence group so as to sequentially correspond to obtain enhanced intermediate frames. The invention generates the semantic attention matrix related to the target characteristic diagram at the characteristic level, thereby improving the perception visual quality.

Description

Attention mechanism-based video bit enhancement method

Technical Field

The invention relates to a video bit enhancement method. In particular to a video bit enhancement method based on an attention mechanism.

Background

Multimedia resources such as images and videos bear abundant information, and people can quickly know things happening outside through the images and videos. Since the advent of video recording devices and display devices, efforts have been directed to obtaining and displaying higher quality images and videos. In pursuit of better visual experience, a High Dynamic Range (HDR) technique has been proposed, which uses a higher Dynamic Range and a greater bit depth (typically 10 or 12 bits) to represent one pixel. Images and videos with high dynamic range may exhibit richer colors, finer color transitions, and more realistic texture details. With the development of technology, ultra-high definition displays and HDR displays are becoming popular choices. However, the vast majority of images and videos previously captured with older camcorders have only 8 bits of bit depth, and when they are presented on HDR displays, false contours and color distortions occur ^[1] And the like, which are not friendly to the visual experience of people. Thus, bit depth enhancement of low bit depth images and video is useful for enhancing the human sensory experienceHas very important significance and value.

Early Bit depth enhancement methods such as Zero Padding (ZP), ideal Gain Multiplication (MIG), and Bit Replication (BR) ^[2] And the bit enhancement method is based on independent pixels, and although the calculation is simple and quick, the false contour effect is still obvious. Later, some differential-based methods were proposed, such as the Contour Region Reconstruction algorithm (CRR) ^[3] Content Adaptive Image Bit-Depth enhancement algorithm (CA) ^[4] And Adaptive inverse quantization algorithm (IPAD) using luminance Potential energy ^[5] And the like. The method considers the context information around the pixel and can better eliminate the false contour effect, but the image contents reconstructed by the method have the phenomena of blurring, detail loss and the like. In recent years, neural networks have achieved remarkable success in the field of computer vision, and have demonstrated strong learning and adaptive abilities for specific tasks. Therefore, the deep learning is also introduced into the field of Bit Depth Enhancement, and the image Bit Depth Enhancement algorithm (BE-CNN) based on the Convolutional Neural Network ^[6] By cascading the Bit depth Enhancement algorithm (BE-CALF: bit-depth Enhancement by Concatenating All Level Features of DNN) ^[7] And a Learning-based bit-depth enhancement method (BitNet: learning-based bit-depth expansion) ^[8] All have better performance.

The bit enhancement methods are all image-oriented, if the bit enhancement methods are applied to a video sequence with low bit depth, redundant information of front and rear frames of a video cannot be well utilized, and the generated high-bit video sequence has the phenomenon of inter-frame flicker and the like.

Disclosure of Invention

The invention aims to solve the technical problem of providing a video bit enhancement method based on an attention mechanism, which can quickly reconstruct a high-bit intermediate frame with better subjective quality and objective quality.

The technical scheme adopted by the invention is as follows: a video bit enhancement method based on an attention mechanism comprises the following steps:

1) Firstly, the bit depth of a video signal to be enhanced is called low bit depth, the bit depth of the enhanced video signal is called high bit depth, the difference between a high bit depth image and a high bit depth image obtained by applying a zero filling algorithm to the low bit depth image is called a residual image, and a video bit enhancement model based on an attention mechanism is established;

2) Randomly selecting a set number of original video sequence groups with high bit depth from an image enhancement database to construct a training data set;

3) Training a video bit enhancement model based on an attention mechanism by using the constructed training data set;

4) Selecting a video sequence group from an image enhancement database to form a test set, and testing the trained video bit enhancement model based on the attention mechanism;

5) The method comprises the steps of applying a zero padding algorithm to a video signal to be enhanced to obtain a high bit depth video signal, sequentially inputting the high bit depth video signal obtained by applying the zero padding algorithm to a tested video bit enhancement model based on the attention mechanism by taking 5 frames as a group, and adding an output result of the video bit enhancement model based on the attention mechanism and an intermediate frame of a video sequence group correspondingly input to sequentially correspond to obtain enhanced intermediate frames.

The video bit enhancement method based on the attention mechanism has the advantages that:

1. the invention takes a coder-decoder network as a framework of the network, a global attention alignment module is added before the coder network, and the module can calculate the correlation among video sequence frames to generate an attention diagram, amplify characteristic points with high correlation and carry out video alignment implicitly.

2. According to the invention, a semantic attention module guided by a target is added between an encoder and a decoder network, and the module takes a feature map of a target frame as guidance to generate a semantic attention matrix related to the target feature map at a feature level, so that the perception visual quality is improved.

Drawings

FIG. 1 is a block diagram of a video bit enhancement method based on an attention mechanism according to the present invention;

FIG. 2 is a network overall framework;

FIG. 3 is a global attention alignment module;

FIG. 4 is a semantic attention module for target guidance.

Detailed Description

The following provides a detailed description of a video bit enhancement method based on attention mechanism in accordance with the present invention with reference to the following embodiments and accompanying drawings.

As shown in fig. 1, a video bit enhancement method based on attention mechanism of the present invention includes the following steps:

the video bit enhancement model based on the attention mechanism comprises the following components connected in sequence: global attention alignment module 1, encoder 2, target-guided semantic attention module 3 and decoder 4, wherein,

the input end of the global attention alignment module 1 receives 5 continuous video frames for capturing long-distance dependence between frames and in frames, and outputs the 5 continuous video frames after implicit alignment;

the encoder 2 receives 5 continuous video frames after implicit alignment, respectively and simultaneously extracts spatial features for each frame, and respectively outputs a feature map containing intra-frame spatial information of the corresponding frame;

the target-guided semantic attention module 3 receives the 5 feature maps output by the encoder 2, performs spatiotemporal feature fusion to obtain a feature map containing spatiotemporal feature information, and acquires feature information similar to the feature map of the intermediate frame output by the encoder 2 from the feature map and outputs the feature information to the decoder 4;

the decoder 4 reconstructs the received feature information step by step into a residual map.

Wherein,

as shown in fig. 3, the global attention alignment module 1 includes:

(1.1) cascading 5 consecutive video frames in the channel direction to obtain a signal with dimension TC × H × W, expressed as

Wherein T represents the number of consecutive frames, C represents the number of channels per frame, H, W represents the height and width of the input video frame;

(1.2) mixing

Respectively sending into 31 × 1 convolution kernels for linear transformation to obtain linearly transformed signals

Then will be

Rearranged into a two-dimensional matrix of dimensions TC × HW, denoted

Superscript 2 indicates that the feature map dimension is 2;

(1.3) pairs

The transformation is performed by the following formula:

wherein,

representation matrix multiplication, (.) ^T Transposing the matrix by the table;

obtained by

And

the similarity matrix of (a) is determined,

representing after weighted summation

Dimension is HW × TC; will be provided with

Transpose and then rearrange into a matrix of dimensions TC H W, noted

(1.4) mixing

Rearranging the video frames to T multiplied by C multiplied by H multiplied by W dimension through a convolution kernel of 1 multiplied by 1, and then carrying out residual error connection with the input 5 continuous video frames to obtain the 5 continuous video frames after implicit alignment.

As shown in fig. 2, the encoder 2 includes 5 convolution branches corresponding to 5 consecutive video frames, each convolution branch is formed by sequentially connecting 5 convolution layers in series, and each convolution layer includes a 3 × 3 convolution kernel and a prime lu activation function.

As shown in fig. 4, the target-guided semantic attention module 3 includes:

(3.1) receiving 5 feature maps output by the encoder 2, wherein the dimension of each feature map is Ch multiplied by H multiplied by W, ch represents the number of channels of each feature map, H, W represents the height and width of the feature map, and the 5 feature maps are cascaded in the channel direction to become the feature map with the dimension of 5Ch multiplied by H multiplied by W;

(3.2) then further fusing the spatio-temporal information through a convolution kernel of 3 x 3 to obtain a new feature map

The dimension of the characteristic diagram is Ch multiplied by H multiplied by W;

(3.3) mapping the new features

Rearranged into a two-dimensional matrix, denoted

The dimension is Ch × HW, and the intermediate feature map of the 5 feature maps received from the encoder 2 is

And rearranged into a two-dimensional matrix, denoted

Dimension is Ch × HW;

(3.4) pairs

And

the following operations were carried out:

wherein,

representation matrix multiplication, (.) ^T The table transposes the matrix;

obtained by

And

the similarity matrix of (a) is determined,

representing after weighted summation

The dimension is HW multiplied by Ch, and the dimension is rearranged into Ch multiplied by H multiplied by W after transposition and is marked as

Representing after weighted summation

(3.5) mixing

And

after residual connection, a convolution kernel of 3 x 3 is sent to extract features.

As shown in fig. 1, the decoder 4 is formed by sequentially connecting 5 transposed convolutional layers in series, each of which contains a transposed convolutional kernel and a prellu activation function, wherein the input of the second transposed convolutional layer is the sum of the output of the first transposed convolutional layer and the output of the fourth convolutional layer in each branch of the encoder 2, and the input of the fourth transposed convolutional layer is the sum of the output of the third convolutional layer and the output of the second convolutional layer in each branch of the encoder 2.

2) Randomly selecting a set number of original video sequence groups with high bit depth from an image enhancement database to construct a training data set; the method comprises the steps of quantizing original video sequence groups to low bit depth, wherein each video sequence group comprises 5 continuous video frames, applying a zero filling algorithm to the video sequence groups with the low bit depth to expand the video sequence groups into video sequences with high bit depth, and subtracting intermediate frames of the original video sequence groups from intermediate frames of the video sequences with the high bit depth expanded by the zero filling algorithm to obtain a real residual error image to form a training data set.

3) Training a video bit enhancement model based on an attention mechanism by using the constructed training data set; in the training, the input of the network is a video sequence which is expanded into a high bit depth by applying a zero filling algorithm to a video sequence group with low bit depth in a training data set, and the output is a residual error map; an Adam optimizer is used to optimize the video bit enhancement model based on the attention mechanism by using Mean Square Error loss (MSE) as a loss function of the residual map generated by the network and the real residual map.

4) Selecting a video sequence group from an image enhancement database to form a test set, and testing the trained video bit enhancement model based on the attention mechanism; the method comprises the steps of quantizing a video sequence group forming a test set to a low bit depth, expanding the video sequence group to a high bit depth by applying a zero filling algorithm, inputting the high bit depth video sequence to a trained attention-based video bit enhancement model to obtain a residual image of an intermediate frame predicted by the model, adding the residual image and the intermediate frame of the high bit depth video sequence expanded by the zero filling algorithm to obtain a reconstructed high bit depth intermediate frame, and evaluating the quality of the reconstructed high bit depth intermediate frame by adopting an evaluation method. The evaluation method adopts two methods of Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index (SSIM).

Example 1

The embodiment of the invention comprises the following steps:

101: from a 16-bit Sintel database ^[9] Randomly selecting 1000 original groups of video sequences, quantizing each group of 5 frame video frames to 4 bit depth, applying a zero-padding algorithm to the 4 bit depth video sequence to expand the 4 bit video sequence into a 16 bit depth video sequence, and calling the 16 bit depth video frame expanded by the zero-padding algorithm as a rough high bit depth video frame;

102: in the embodiment, a coder and a decoder are used as a network basic framework, a global attention alignment module is added at the head of an encoder, and the module can capture long-distance dependence by calculating the correlation between the video sequence frame and perform implicit Motion Estimation and Motion Compensation (ME & MC); and adding a semantic attention module of target guidance at the joint of the encoder and the decoder, fusing the spatial features extracted by the encoder by the semantic attention module, and then, taking the intermediate frame as a guidance feature to be related to the fused feature map filled with the space-time semantic features to obtain a semantic attention matrix. And carrying out matrix multiplication on the semantic attention matrix and the feature map filled with the space-time semantic features to obtain a transformed feature map. The module can help the network to focus more on the information related to the target frame at the semantic level, and the perception quality is improved.

103: the coarse high bit depth video sequence is input into the network, and a residual map is generated. For original high bit depth video sequence intermediate frame and coarse high bit depth videoAnd performing difference on the inter frames to obtain a real residual error image. Using Mean Square Error loss (MSE) as a loss function for network generated residual maps and true residual maps, using an Adam optimizer ^[11] And optimizing the video bit enhancement model based on the attention mechanism.

104: in the testing phase, 50 sets of video sequences with 16-bit depth different from the training set are randomly selected from the Sintel dataset, and from the Tears of Steel (TOS) dataset ^[9] 30 groups of video sequences with 16 bit depths are selected. The test set is quantized to 4 bit depth and then back quantized to a coarse high bit depth video sequence using a zero-padding algorithm. And loading the trained model parameters to a video bit enhancement model based on an attention mechanism, then transmitting the rough high bit depth video sequence to the model to generate a residual map, and adding the residual map and the intermediate frame of the rough high bit depth video to obtain a reconstructed high bit depth map. Peak signal-to-noise Ratio (PSNR) and Structural Similarity (SSIM) are used ^[12] These two objective evaluation criteria evaluate the test results to verify the effectiveness of the invention.

In summary, the embodiment of the present invention designs a video bit depth enhancement method based on attention mechanism through steps 101 to 104. A global attention alignment module is introduced on a classical coding and decoding network, and a target-guided semantic attention module is added. The global attention module has the same effect as the motion estimation and the motion compensation, and can capture long-distance dependence to acquire auxiliary information which is useful for target frame reconstruction from a video sequence. The method can avoid two-stage processing of motion estimation and motion compensation, and has low computational complexity and operation time. The semantic attention module of the target guidance can generate a spatiotemporal feature map highly related to the feature map of the target frame by taking the feature map of the target frame as guidance at a semantic level. The invention can realize end-to-end video bit depth enhancement at one stage, avoids high calculation complexity of motion compensation and has better reconstruction quality.

Example 2

The following example 1 protocol was evaluated for efficacy in combination with specific experimental data, as described in detail below:

301: data composition

The test set consists of 50 sets of 16-bit depth consecutive video frames randomly drawn in the sinter database, which do not overlap with the training set, and 30 sets of 16-bit depth consecutive video frames randomly drawn in the TOS database, each set containing 5 frames of pictures.

302: evaluation criterion

The invention mainly adopts two evaluation indexes to evaluate the quality of the reconstructed high bit depth video frame:

peak Signal to Noise Ratio (PSNR) is a commonly used objective image quality assessment method for evaluating the quality of an image.

Structural Similarity Index (Structural Similarity Index, SSIM) ^[12] The method is an index for measuring the structural similarity of two images. The index measures the similarity of two images from three angles of brightness, contrast and structure of the images, the method is more in line with the visual characteristics of human eyes, and the subjective effect of the images can be reflected. The evaluation index ranges from 0 to 1, and the higher the score is, the more similar the reconstructed high-bit image is to the original high-bit image, and the better the reconstruction quality is.

303 comparison algorithm

The embodiment of the invention is compared with 10 bit depth enhancement algorithms, which comprise 8 traditional image bit enhancement methods, 1 image bit enhancement method based on a neural network and 1 video bit enhancement method based on the neural network.

The 8 conventional image bit enhancement methods include: 1) Zero Padding algorithm (ZP); 2) Ideal Gain product algorithm (MIG); 3) Bit Replication algorithm (Bit Replication, BR) ^[2] (ii) a 4) Based on the Minimum Risk Classification algorithm (MRC) ^[10] (ii) a 5) Contour Region Reconstruction algorithm (CRR) ^[3] (ii) a 6) Content Adaptive Image Bit Depth enhancement algorithm (CA) ^[4] (ii) a 7) After maximumA test Estimation AC Signal algorithm (Maximum a Posteriori Estimation of AC Signal, ACDC) ^[14] (ii) a 8) Adaptive inverse quantization algorithm (IPAD) using luminance Potential energy ^[5] 。

The image Bit Enhancement method based on the Neural Network is the image Bit Depth Enhancement algorithm (BE-CNN) based on the convolution Neural Network ^[6] 。

The Video Bit Enhancement method based on the neural network is a Video Bit Depth Enhancement algorithm (VBDE) based on a space-time Symmetric Convolutional neural network ^[13] 。

Table 1 lists the results of the test on the Sintel test set and the TOS test set of this method compared with the ten other methods. The PSNR of the method is as high as 41.5293 on the Sintel test, and SSIM reaches 0.9672, which is obviously higher than the performances of other methods. The TOS dataset is two distinct datasets, with more content differences, than the sinter dataset, and the TOS dataset contains more and more complex scenes and content. The PSNR of the method reaches 39.3155 on a TOS test set, the SSIM reaches 0.9572, and the method has good universality. This test fully demonstrates the effectiveness of the method.

TABLE 1

Reference documents

[1]Wan P,Au O C,Tang K,et al.From 2d extrapolation to 1d interpolation:Content adaptive image bit-depth expansion[C]//2012IEEE International Conference on Multimedia and Expo.IEEE,2012:170-175..

[2]Ulichney R A,Cheung S.Pixel bit-depth increase by bit replication[C]//Color Imaging:Device-Independent Color,Color Hardcopy,and Graphic Arts III.International Society for Optics and Photonics,1998,3300:232-241.

[3]Cheng C H,Au O C,Liu C H,et al.Bit-depth expansion by contour region reconstruction[C]//2009IEEE International Symposium on Circuits and Systems.IEEE,2009:944-947.

[4]Wan P,Au O C,Tang K,et al.From 2d extrapolation to 1d interpolation:Content adaptive image bit-depth expansion[C]//2012IEEE International Conference on Multimedia and Expo.IEEE,2012:170-175.

[5]Liu J,Zhai G,Liu A,et al.IPAD:Intensity potential for adaptive de-quantization[J].IEEE Transactions on Image Processing,2018,27(10):4860-4872.

[6]Liu J,Sun W,Liu Y.Bit-depth enhancement via convolutional neural network[C]//International Forum on Digital TV and Wireless Multimedia Communications.Springer,Singapore,2017:255-264.

[7]Liu J,Sun W,Su Y,et al.BE-CALF:bit-depth enhancement by concatenating all level features of DNN[J].IEEE Transactions on Image Processing,2019,28(10):4926-4940.

[8]Byun J,Shim K,Kim C.BitNet:Learning-Based Bit-Depth Expansion[C]//Asian Conference on Computer Vision.Springer,Cham,2018:67-82.

[9]Foundation X.Xiph.Org,https://www.xiph.org/,2016.

[10]Mittal G,Jakhetiya V,Jaiswal S P,et al.Bit-depth expansion using minimum risk based classification[C]//2012Visual Communications and Image Processing.IEEE,2012:1-5.

[11]Kingma D P,Ba J.Adam:A method for stochastic optimization[J].arXiv preprint arXiv:1412.6980,2014.[12]ZEILER M D,KRISHNAN D,TAYLOR G W,et al.Deconvolutional networks；proceedings of the Computer Vision and Pattern Recognition,F,2010[C].

[12]Wang Z,Bovik AC,Sheikh H R,et al.Image quality assessment:from error visibility to structural similarity[J].IEEE transactions on image processing,2004,13(4):600-612.

[13]Liu J,Liu P,Su Y,et al.Spatiotemporal symmetric convolutional neural network for video bit-depth enhancement[J].IEEE Transactions on Multimedia,2019,21(9):2397-2406.

[14]Wan P,Cheung G,Florencio D,et al.Image bit-depth enhancement via maximum a posteriori estimation of AC signal[J].IEEE Transactions on Image Processing,2016,25(6):2896-2909.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-mentioned serial numbers of the embodiments of the present invention are only for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A video bit enhancement method based on an attention mechanism is characterized by comprising the following steps:

the video bit enhancement model based on the attention mechanism comprises the following components connected in sequence: a global attention alignment module (1), an encoder (2), a target-guided semantic attention module (3) and a decoder (4), wherein,

the input end of the global attention alignment module (1) receives 5 continuous video frames, is used for capturing long-distance dependence between frames and in frames, and outputs the 5 continuous video frames after implicit alignment;

the encoder (2) receives 5 continuous video frames after implicit alignment, simultaneously extracts spatial features of each frame respectively, and outputs a feature map containing intra-frame spatial information of the corresponding frame respectively;

the target-guided semantic attention module (3) receives 5 feature maps output by the encoder (2), performs space-time feature fusion to obtain a feature map containing space-time feature information, acquires feature information similar to the feature map of the intermediate frame output by the encoder (2) from the feature map and outputs the feature information to the decoder (4);

the decoder (4) gradually reconstructs the received characteristic information into a residual error map;

the global attention alignment module (1) comprises:

(1.1) cascading 5 consecutive video frames in the channel direction to obtain a signal with dimension TC × H × W, denoted as

(1.2) mixing

Then will be

Rearranged into a two-dimensional matrix having dimensions TC × HW, denoted

Superscript 2 indicates that the dimension of the two-dimensional matrix is 2;

(1.3) pairs

The transformation is performed by the following formula:

wherein,

represents a matrix multiplication, (·) ^T Transposing the matrix by the table;

obtained by

And

the similarity matrix of (a) is determined,

representing after weighted summation

Dimension is HW × TC; will be provided with

Transpose and then rearrange into a matrix of dimensions TC H W, noted

(1.4) mixing

Rearranging the video frames to T multiplied by C multiplied by H multiplied by W dimensionality through a 1 multiplied by 1 convolution kernel, and then carrying out residual error connection with the input 5 continuous video frames to obtain 5 continuous video frames after implicit alignment;

the target-guided semantic attention module (3) comprises:

(3.1) receiving 5 feature maps output by the encoder (2), wherein the dimension of each feature map is Ch multiplied by H multiplied by W, ch represents the number of channels of each feature map, H, W represents the height and width of the feature maps, and the 5 feature maps are cascaded in the channel direction to become the feature map with the dimension of 5Ch multiplied by H multiplied by W;

(3.3) mapping the new features

Rearranged into a two-dimensional matrix, denoted

The dimension is Ch multiplied by HW, and the middle characteristic diagram of the 5 characteristic diagrams received from the encoder (2) is

And rearranged into a two-dimensional matrix, denoted

Dimension is Ch × HW;

(3.4) pairs

And

the following operations were carried out:

wherein,

obtained by

And

the similarity matrix of (a) is obtained,

representing after weighted summation

Representing after weighted summation

(3.5) mixing

And

after residual error connection, sending the residual error connection into a convolution kernel of 3 multiplied by 3 to extract features;

2. The method according to claim 1, wherein the encoder (2) comprises 5 convolution branches corresponding to 5 consecutive video frames, each convolution branch comprising 5 convolutional layers connected in series, each convolutional layer comprising a 3 x 3 convolutional kernel and a PReLU activation function connected in series.

3. The method according to claim 1, wherein the decoder (4) is formed by sequentially concatenating 5 transposed convolutional layers, each of which comprises a transposed convolutional kernel and a PReLU activation function, wherein the input of the second transposed convolutional layer is the sum of the output of the first transposed convolutional layer and the output of the fourth convolutional layer in each branch of the encoder (2), and the input of the fourth transposed convolutional layer is the sum of the output of the third convolutional layer and the output of the second convolutional layer in each branch of the encoder (2).

4. The method as claimed in claim 1, wherein the step 2) comprises quantizing original video sequence groups to low bit depth, wherein each video sequence group comprises 5 consecutive video frames, applying zero-padding algorithm to the low bit depth video sequence groups to expand the video sequence to high bit depth, and subtracting the intermediate frames of the original video sequence groups from the intermediate frames of the high bit depth video sequence expanded by the zero-padding algorithm to obtain the real residual map to form the training data set.

5. The method according to claim 1, wherein in the training of step 3), the input of the network is a video sequence in the training data set that is extended to a high bit depth by applying a zero padding algorithm to the video sequence group with a low bit depth, and the output is a residual map; and optimizing the video bit enhancement model based on the attention mechanism by using an Adam optimizer by adopting the mean square error loss as a loss function of the residual map generated by the network and the real residual map.

6. The method as claimed in claim 1, wherein the step 4) includes quantizing the group of video sequences constituting the test set to a low bit depth, expanding the video sequences to a high bit depth by applying a zero-padding algorithm, inputting the video sequences with the high bit depth into a trained video bit enhancement model based on the attention mechanism to obtain a residual map of an intermediate frame predicted by the model, adding the residual map to the intermediate frame of the video sequences with the bit depth expanded by the zero-padding algorithm to obtain a reconstructed intermediate frame with the high bit depth, and evaluating the quality of the reconstructed intermediate frame with the high bit depth by using an evaluation method.

7. The method of claim 6, wherein the evaluation method is a peak SNR and a structural similarity index.