CN115209119B

CN115209119B - Video automatic coloring method based on deep neural network

Info

Publication number: CN115209119B
Application number: CN202210678884.0A
Authority: CN
Inventors: 晋建秀; 杨镒彰; 郭锴凌; 徐向民
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2023-06-23
Anticipated expiration: 2042-06-15
Also published as: CN115209119A

Abstract

The invention discloses a video automatic coloring method based on a deep neural network, which comprises the following steps: acquiring an original color video data set, and converting a color video into a black-and-white video to obtain a black-and-white video frame sequence and a color video frame sequence for network training; respectively calculating the forward and reverse optical flows between two adjacent frames in the color video frame sequence and the black-and-white video frame sequence; selecting three adjacent frames from the data set, and inputting the three adjacent frames into a feature extraction network to extract a feature information graph; calculating adjacent similar areas of each pixel point in the target image; inputting the characteristic information graph into a preliminary coloring network to obtain a plurality of preliminary coloring graphs of each frame of image; inputting a three-frame preliminary coloring chart to an optical flow alignment module, and restraining the preliminary coloring network by using a time sequence loss function; inputting the output of the preliminary coloring network and the output of the optical flow alignment module into the enhanced coloring network to obtain a final output, and utilizing L ₁ The norm calculates the error of the final output and the true value.

Description

Video automatic coloring method based on deep neural network

Technical Field

The invention relates to the technical field of video processing, in particular to an automatic video coloring method based on a deep neural network.

Background

The video we now see is mostly colored. The color is an indispensable element in the process from shooting to processing to being watched. However, in the past, the shooting technique has not been able to produce color video, and many excellent movie works still have been produced. These excellent video works are, without exception, black and white video, and are hardly uncomfortable for people who are accustomed to watching color video. In order to preserve the traditional excellent movie works, which are better accepted by modern people, black and white video coloring (video colorization) technology is proposed to color black and white video. In the image coloring method, the focus of the study is on the spatial correlation within the frame, and the mapping relationship of black and white to color is learned by studying the spatial information within the colored frame. In video shading methods, however, multi-frame shading should take into account the temporal correlation from frame to generate coherent shading content. Thanks to the development of the field of deep learning in the field of computer vision, a method for automatically coloring video using a deep neural network is widely used. Coloring with deep neural networks generally achieves better results than traditional coloring methods.

Existing video coloring algorithms can be broadly divided into reference frame-based coloring methods and reference frame-free direct coloring methods. The reference frame coloring method is used for obtaining one or more high-quality color reference frames through a manual coloring method, and color information is transmitted into other frames in a video sequence by utilizing a similarity matrix, so that the coloring process of the video is realized. The method of direct coloring without reference frame directly colors the video sequence without reference frame.

The video coloring method has the following problems: 1) In video coloring methods, the problem of video sequence discontinuity is the most important one. The same object in the video may have different pixel values in different frames, resulting in video jitter or artifacts. A common approach to solving the problem of video sequence discontinuity today is optical flow alignment, but most of these approaches only consider unidirectional optical flow. 2) The coloring problem is a one-to-many problem in that an input frame can correspond to a plurality of output frames without coloring the reference frame, that is, coloring a black-and-white object, with an uncertainty as a result. In video coloring, the same object in two frames that are farther apart may have a significantly different color.

Currently, deep neural networks have achieved tremendous efforts in dealing with video coloring problems, and most of the existing video coloring methods are implemented based on deep neural networks. In Learning blind video temporal consistency, in ECCM2018, wei-creating Lai et al, each frame of a video is first colored by a picture coloring method, and then the sequential consistency of the video is improved by using forward optical flow information between adjacent frames and forward optical flow information between frames farther apart. Chenyang Lei et al, fully Automatic Video Colorization with Self-Regularization and Diversity [ J ].2019, propose a diversity sense loss function such that the coloring results tend to be uniform, and the forward optical flow information between adjacent frames is utilized to improve the timing consistency of the video. However, the above methods only use the forward optical flow information, and do not use the backward optical flow information.

Disclosure of Invention

Based on the description of the problems, the invention provides an automatic video coloring method based on a deep neural network, which solves the problem of uncertain coloring results and simultaneously utilizes forward optical flow information and backward optical flow information to realize the automatic video coloring method, wherein an optical flow alignment module utilizes a bidirectional optical flow alignment method comprising forward optical flow and backward optical flow to solve the problem of incoherence of coloring video sequences, and utilizes a diversity loss function to solve the problem of uncertain video coloring results.

The invention is realized at least by one of the following technical schemes.

A video automatic coloring method based on a deep neural network comprises the following steps:

s1, acquiring an original color video data set, and converting a color video into a black-and-white video to obtain a black-and-white video frame sequence and a color video frame sequence for network training;

s2, respectively calculating forward and backward optical flows between two adjacent frames in the color video frame sequence and the black-and-white video frame sequence;

s3, selecting three adjacent frames from the black-and-white data set, and inputting the three adjacent frames into a feature extraction network to extract a feature information graph;

s4, calculating adjacent similar areas of each pixel point in the target image, and restraining the preliminary coloring network by using a consistency loss function in the adjacent similar areas;

s5, inputting the characteristic information graph into a preliminary coloring network to obtain a plurality of preliminary coloring graphs of each frame of image, and restraining the preliminary coloring network by using a diversity loss function on the preliminary coloring graph;

s6, inputting a three-frame preliminary coloring chart into an optical flow alignment module, and simultaneously restraining a preliminary coloring network by utilizing a time sequence loss function;

s7, inputting the output of the preliminary coloring network and the output of the optical flow alignment module to the enhanced coloring network to obtain a final output, and utilizing L ₁ The norm calculates the error of the final output and the true value.

Further, the adjacent similar region in step S4 refers to a bilateral space calculated in the real-valued picture using red, green, blue, lateral distance and longitudinal distance information, and is used to indicate a pixel region having similar color information to a pixel in the adjacent region.

Further, the bilateral space first calculates the bilateral distance between the target pixel point and the other pixel points:

where r, g, b are differences in three channel pixel values of the target pixel point and the remaining pixel points in the RGB color space R, G, B, respectively, w and h are lateral distances and longitudinal distances between the target pixel point and the remaining pixel points, and λ is a weight for balancing the spatial distance and the color distance.

Further, the consistency loss function is as follows:

where t denotes the current t frame, n denotes the video sequence for n frames, p and q denote pixel positions, Y ^t A true value image representing the t-th frame,

representing the slave Y ^t The adjacent similar region, X, of the pixel p obtained in (a) ^t Representing an input t-th frame original image, f _p (X ^t ) X represents ^t After inputting the preliminary coloring network f, obtaining the pixel value of the p position in the coloring chart, f _q (X ^t ) X represents ^t After the preliminary coloring network f is input, a coloring chart is obtainedPixel values of q positions of (c).

Further, step S5 obtains any of a plurality of preliminary coloring images, and selects a picture with the highest average saturation of pixels in the preliminary coloring image from the plurality of preliminary coloring images of each frame of image as an input of the optical flow alignment module.

Further, the diversity loss function described in step S5 is:

where t denotes the current t frame and n denotes the total of n frames of the video sequence, C ^t (i) Representing an ith preliminary coloring picture output by a t-th frame through a preliminary coloring network, Y ^t True value image representing the t-th frame, alpha _i Representing a decreasing number sequence, phi represents a feature map extracted from the pre-trained VGG-19 network, and d represents the number of preliminary coloring pictures generated.

Further, the timing loss function in step S6 is a forward backward timing consistency loss function:

where t denotes the current t frame, n denotes the video sequence for n frames, X ^t Represents an input t-th frame original image, f (X ^t ) Representing an input t-th frame original image X ^t After inputting the preliminary coloring network f, a coloring diagram omega is obtained _t-1→t Is to use t-1 frame to forward optical flow of t frame for distortion alignment operation, omega _t+1→t Is to perform a warp alignment operation using the reverse optical flow from the t+1st frame to the t frame, M _t-1→t And M _t+1→t Is a binary mask consisting of 0 and 1, +..

Further, the output of the optical flow alignment module in step S6 includes a preliminary color chart C ^t Omega (C) obtained by optical flow distortion ^t-1 )、ω(C ^t+1 )、ω(C ^t-1 ) And C ^t Confidence map A of (2) _t-1→t 、ω(C ^t+1 ) And C ^t Confidence map B of (1) _t+1→t The method comprises the steps of carrying out a first treatment on the surface of the The method for calculating the confidence map comprises the following steps:

wherein, beta is a parameter for adjusting the value of the confidence map; c (C) ^t A preliminary coloring map representing a t-th frame; omega _t-1→t (C ^t-1 ) Is to use the forward optical flow information from the t-1 th frame to the t-th frame for C ^t-1 Results, ω, obtained after performing the warp alignment operation _t+1→t (C ^t ⁺¹ ) Is to use the forward optical flow information from the t+1st frame to the t frame for C ^t+1 And (3) performing a distortion alignment operation.

Further, L is described in step S7 ₁ The norm loss function is:

where t denotes the current t frame, n denotes the total of n frames of the video sequence, g (f (X) ^t-1 )，f(X ^t )，f(X ^t+1 ) (ii) represents the output O of the enhanced coloring network g ^t ，Y ^t Representing the true value of the t frame.

Further, the preliminary coloring network and the enhanced coloring network adopt the same network structure, input data firstly passes through a convolution layer, then passes through a plurality of continuous downsampling blocks, then passes through the convolution layer, and then passes through a plurality of continuous upsampling blocks, wherein the output of each downsampling block convolution layer is spliced into the input of the upsampling block convolution layer with the same dimension in the channel dimension.

Compared with the prior art, the invention has the beneficial effects that:

according to the method, the adjacent similar areas are calculated, pixels in the adjacent similar areas are kept the same or similar, so that a coloring picture which is smooth in space is obtained, a plurality of preliminary coloring pictures are used, a diversity loss function is added, the problem that coloring results are not uniform in video coloring problems is solved, and the results tend to be uniform; through the optical flow relation among the three frames of pictures, the time sequence consistency of coloring results is enhanced, artifacts and jitter phenomena in the coloring results are reduced, and the coloring video colors are consistent and uniform by calculating the forward optical flow distortion of the first frame to the second frame and the backward optical flow distortion of the third frame to the second frame.

Drawings

FIG. 1 is a flow chart of an embodiment of a video automatic coloring method based on a deep neural network;

FIG. 2 is a schematic diagram of a video automatic coloring method based on a deep neural network according to an embodiment;

FIG. 3 is a schematic diagram of the input and output architecture of a preliminary coloring network;

FIG. 4 is a schematic diagram of an optical flow alignment module;

FIG. 5 is a schematic diagram of the structure of a preliminary coloring network and an enhanced coloring network.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not to be construed as limiting the present invention.

Example 1

Step S1, an original color video data set is obtained, color video is converted into black-and-white video, and a black-and-white video frame sequence and a color video frame sequence for network training are obtained.

Specifically, the present embodiment uses the DAVIS data set and the VIDEVO data set to train and test the preliminary and enhanced coloring networks. In addition, an ImageNet dataset was used to train the preliminary coloring network. Each frame picture of video in the video dataset is converted from RGB (red (R), green (G), blue (B)) color space to YCbCr (Y refers to the luminance component, cb refers to the blue chrominance component, and Cr refers to the red chrominance component) color space, leaving only the Y channel in the image YCbCr and taking it as the original input of the model.

And S2, respectively calculating the forward and backward optical flows between two adjacent frames in the color video frame sequence and the black-and-white video frame sequence. Specifically, in calculating optical flow, a pre-trained optical flow model is used that is capable of predicting pixel movement between any two adjacent video frames and outputting the result in the form of a feature pattern with a channel number of 2. These two channels represent lateral and longitudinal movements in a two-dimensional plane, respectively, representing optical flow information between two adjacent frames. And respectively inputting the color video frame sequence and the black-and-white video frame sequence of the data set into a pre-trained PWC-Net optical flow model in the forward direction and the reverse direction to obtain the forward optical flow information and the reverse optical flow information between two adjacent frames.

And S3, selecting three adjacent frames from the black-and-white data set as input data, and inputting the input data into a feature extraction network to extract a feature information graph as input of a preliminary coloring network.

Specifically, three adjacent frames X are randomly selected in the dataset ^t-1 、X ^t 、X ^t+1 As an input data set. The method adopted in the step S3 for extracting the characteristic information graph in the network is to extract the low-dimensional and high-dimensional characteristic information of the graph by utilizing a pretrained VGG-19 network for training the target classification task on the ImageNet data set. The VGG-19 network has 5 consecutive convolutional blocks, each block has multiple convolutional layers, a maximum pooling layer is connected at the end of each block, and finally output is obtained through three full connection layers. The specific method comprises the steps of inputting pictures into a pretrained VGG-19 network, extracting the output of a second convolution layer in each convolution block, obtaining five feature images in total, and obtaining an output feature image matched with the resolution of the input pictures through bilateral upsampling.

And S4, calculating adjacent similar areas of each pixel point in the target image, and restraining the preliminary coloring network by using a consistency loss function in the adjacent similar areas.

Specifically, the adjacent similar region in step S4 refers to a bilateral space calculated in the real-valued picture by using information such as red, green, blue, lateral distance, and longitudinal distance, and is used to indicate a pixel region having similar color information to a pixel in the adjacent region. The method of calculating the adjacent similar regions may be varied, including but not limited to the method used in this example.

In this example, first, a bilateral distance between a target pixel and other pixels is calculated:

where r, g, b are differences in three channel pixel values of the target pixel point and the remaining pixel point in the RGB color space R, G, B, respectively, w and h are lateral and longitudinal distances between the target pixel point and the remaining pixel point, and λ is a weight for balancing the spatial and color distances, in this example λ=200.

After obtaining the bilateral distance between the target pixel point and the rest of the pixel points, J pixel points closest to the bilateral distance of the target pixel point are taken to form an adjacent similar region of the target pixel, and in this example, j=10.

Since it is desirable that adjacent pixels in the colored picture have the same or similar color as much as possible, a consistency loss function l is used in the adjacent similar region _b To calculate pixel differences in the adjacent similar regions and then to pair l in the preliminary coloring network _b And (3) performing constraint so that pixels in adjacent similar areas tend to be consistent as much as possible, and pixel points in the adjacent similar areas have similar pixel values, so that the coloring result of the picture is as smooth as possible.

The consistency loss function is as follows:

where t denotes the current t frame, n denotes the video sequence for n frames, p and q denote pixel positions, Y ^t True value graph representing the t-th frameIn the case of an image of a person,

representing the slave Y ^t The adjacent similar region, X, of the pixel p obtained in (a) ^t Representing an input t-th frame original image, f _p (X ^t ) X represents ^t After inputting the preliminary coloring network f, obtaining the pixel value of the p position in the coloring chart, f _q (X ^t ) X represents ^t The pixel value at the q position in the color map is obtained after the preliminary color network f is input.

And S5, inputting the characteristic information graph into a preliminary coloring network to obtain a plurality of preliminary coloring graphs of each frame of image, and restraining the preliminary coloring network by using a diversity loss function on the preliminary coloring graph.

Specifically, as shown in FIG. 3, an input X will be entered ^t-1 、X ^t 、X ^t+1 Respectively inputting the characteristic information graphs of the images into a preliminary coloring network to obtain a plurality of preliminary coloring graphs respectively:

C ^t-1 (1)，C ^t-1 (2)，C ^t-1 (3)...C ^t-1 (d)，C ^t (1)，C ^t (2)，C ^t (3)...C ^t (d)，C ^t+1 (1)，C ^t+1 (2)，C ^t+1 (3)...C ^t+1 (d)。

in this example, the number of preliminary coloring pictures generated d=4, decreasing the number sequence α _i Taking 0.08, 0.04, 0.02 and 0.01 in sequence. And constraining the preliminary coloring network by using a diversity loss function, so that the results of the preliminary coloring network are consistent, and the diversity of coloring results is reduced.

The diversity loss function is:

where t denotes the current t frame and n denotes the total of n frames of the video sequence, C ^t (i) Representing an ith preliminary coloring picture output by a t-th frame through a preliminary coloring network, Y ^t True value image representing the t-th frame, alpha _i Representing a decreasing sequence of digits, phiRepresenting a feature map extracted from a pre-trained VGG-19 network, d representing the number of preliminary coloring pictures generated.

And S6, inputting the three-frame preliminary coloring map into an optical flow alignment module, and simultaneously restraining the preliminary coloring network by utilizing a time sequence loss function.

Specifically, a picture C with highest average saturation of pixels in the preliminary coloring picture is selected from a plurality of preliminary coloring pictures of each frame of picture ^t-1 、C ^t 、C ^t+1 Inputting the three pictures into an optical flow alignment module

The optical flow alignment module uses the forward optical flow pair C from the t-1 st frame to the t-th frame as shown in FIG. 4 ^t-1 Performing twisting alignment operation to obtain twisted omega _t-1→t (C ^t-1 ) Reverse optical flow pair C using t+1st frame to t frame ^t+1 Performing twisting alignment operation to obtain twisted omega _t+1→t (C ^t+1 ) And constraining the preliminary coloring network by utilizing two time sequence loss functions of forward warping and backward warping, enhancing the time sequence consistency of the coloring diagram of the preliminary coloring network, enabling the video coloring effect to be overall smooth, and eliminating artifacts.

Specifically, the warp alignment operation is exemplified as follows: the input optical flow information indicates the required movement of all pixels within the a-frame to align to the B-frame. The pixel coordinates within the a-frame should typically be added to the pixel shift to get the result of a-frame to B-frame alignment. Omega _t-1→t (C ^t-1 ) Is to use the forward optical flow information from the t-1 th frame to the t-th frame to make the C ^t-1 Results, ω, obtained after performing the warp alignment operation _t-1→t (C ^t-1 ) Should be in contact with C ^t And tend to be consistent. Omega _t+1→t (C ^t+1 ) Uses the reverse optical flow information from the t+1st frame to the t frame for C ^t-1 Results, ω, obtained after performing the warp alignment operation _t+1→t (C ^t+1 ) Should also be identical to C ^t And tend to be consistent.

The timing penalty function is a forward-backward timing consistency penalty function:

where t denotes the current t frame, n denotes the video sequence for n frames, X ^t Represents an input t-th frame original image, f (X ^t ) Representing an input t-th frame original image X ^t After inputting the preliminary coloring network f, a coloring diagram omega is obtained _t-1→t Is to use t-1 frame to forward optical flow of t frame for distortion alignment operation, omega _t+1→t Is to perform a warp alignment operation using the reverse optical flow from the t+1st frame to the t frame, M _t-1→r And M _t+1→t Is a binary mask consisting of 0 and 1, +..

Specifically, M in step S6 _t-1→t And M _t+1→t The binary mask is composed of 0 and 1, and has the function of eliminating the part with overlarge difference between the picture and the original picture after the optical flow distortion and reducing the error possibly caused by the optical flow alignment. The specific method comprises the steps of calculating the absolute value of each pixel difference value between a picture and an original picture after optical flow distortion, setting a threshold value M, setting a part of the difference value smaller than the threshold value M to be 1, and setting a part larger than the threshold value M to be 0, so that a binary mask is obtained. In this example, m=0.05.

S7, inputting the output of the preliminary coloring network and the output of the optical flow alignment module into the enhanced coloring network together to obtain a final output, and utilizing L ₁ The norm calculates the error of the final output and the true value.

Specifically, ω is calculated as shown in fig. 4 _t-1→t (C ^t-1 ) And C ^t Confidence map A of (2) _t-1→t Calculate ω _t+1→t (C ^t+1 ) And C ^t Confidence map B of (1) _t+1→t In this example, β=1000. Will preliminary color chart C ^t Omega obtained after optical flow distortion _t-1→t (C ^t-1 )，ω _t+1→t (C ^t+1 ) The confidence map A and the confidence map B are spliced and then are input into the enhanced coloring network together to obtain the final output O of t frames ^t . In this process, L is used ₁ Norm calculation of true value Y and final output O ^t Errors between them.

The method for calculating the confidence map comprises the following steps:

wherein, beta is a parameter for adjusting the value of the confidence map; omega _t-1→t (C ^t-1 ) Is to use the forward optical flow information from the t-1 th frame to the t-th frame to make the C ^t-1 And (3) performing a distortion alignment operation. C (C) ^t A preliminary coloring map representing a t-th frame;

specifically, the preliminary coloring network and the enhanced coloring network employ the same network structure, as shown in fig. 5. The input first passes through a convolution layer with a convolution kernel 1*1 to change the latitude of the input feature map to match the latitude of the network input. Then, there are consecutive 5 downsampled blocks, each downsampled block comprising two convolutionally layers of convolution kernel 3*3 and one maximally pooled layer. And then passing through two convolution layers with a convolution kernel of 3*3, and then passing through 5 continuous upsampling blocks, wherein each upsampling block comprises an upsampling layer and two convolution layers with a convolution kernel of 3*3, and finally obtaining output. Wherein the output of each downsampled block convolutional layer is spliced in the channel dimension into the input of an upsampled block convolutional layer of the same dimension. The activation function in the network is uniformly set to LeakyReLu.

The L is ₁ The norm loss function is:

During training, the preliminary coloring network is trained first. In each training period, follow-up in the ImageNet datasetMachine sampling 5000 pictures using a consistency loss function l _b And a diversity loss function l _d Training is performed, then 1000 sets of adjacent three frames are randomly sampled in the DAVIS, using the timing consistency loss function l _t To perform training. When training the enhanced coloring network, firstly, generating the input of the enhanced coloring network by inputting the preliminary coloring network and the optical flow alignment module, and then utilizing the final output and the L of the true value ₁ And calculating errors by norms and training. The data sets that can be used are diverse, including but not limited to those used in this example

During testing, the black-and-white test set and the black-and-white optical flow sequence of the test set are input into a network to obtain a colored video. Comparing the output result with the original video, and calculating PSNR (Peak Signal to Noise Ratio, namely peak signal to noise ratio), LPIPS (Learned Perceptual Image Patch Similarity, namely effectiveness of depth feature measurement image similarity) indexes; the three-input video coloring method can achieve good effects.

Example 2

The pre-trained optical flow model of this embodiment may select optical flow models such as FlowNet, flowNet, SPyNet, etc.

Example 3

In this example, the method for calculating the bilateral distance between the target pixel and the other pixels is as follows:

r ³ +g ³ +b ³ +λ*w+λ*h

where r, g, b are differences in three channel pixel values of the target pixel point and the remaining pixel point in the RGB color space R, G, B, respectively, w and h are lateral and longitudinal distances between the target pixel point and the remaining pixel point, and λ is a weight for balancing the spatial and color distances, in this example λ=0.01.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. The automatic video coloring method based on the deep neural network is characterized by comprising the following steps of:

s4, calculating adjacent similar areas of each pixel point in the target image, and restraining the preliminary coloring network by using a consistency loss function in the adjacent similar areas; the consistency loss function is as follows:

representing the slave Y ^t The adjacent similar region, X, of the pixel p obtained in (a) ^t Representing an input t-th frame original image, f _p (X ^t ) X represents ^t After inputting the preliminary coloring network f, obtaining the pixel value of the p position in the coloring chart, f _q (X ^t ) X represents ^t Inputting a preliminary coloring network f to obtain a pixel value of a q position in a coloring chart;

s5, inputting the characteristic information graph into a preliminary coloring network to obtain a plurality of preliminary coloring graphs of each frame of image, restraining the preliminary coloring network on the preliminary coloring graphs by utilizing a diversity loss function to obtain any plurality of preliminary coloring graphs, and selecting a picture with highest average saturation of pixels in the preliminary coloring graph from the plurality of preliminary coloring pictures of each frame of image as an input of an optical flow alignment module;

the diversity loss function is as follows:

where t denotes the current t frame and n denotes the total of n frames of the video sequence, C ^t (i) Representing an ith preliminary coloring picture output by a t-th frame through a preliminary coloring network, Y ^t True value image representing the t-th frame, alpha _i Representing a decreasing number sequence, phi representing a feature map extracted from the pre-trained VGG-19 network, d representing the number of preliminary coloring pictures generated;

s6, inputting a three-frame preliminary coloring chart into an optical flow alignment module, and simultaneously restraining a preliminary coloring network by utilizing a time sequence loss function; the timing penalty function is a forward-backward timing consistency penalty function:

where t denotes the current t frame, n denotes the video sequence for n frames, X ^t Represents an input t-th frame original image, f (X ^t ) Representing an input t-th frame original image X ^t After inputting the preliminary coloring network f, a coloring diagram omega is obtained _t-1→t Is to use t-1 frame to forward optical flow of t frame for distortion alignment operation, omega _t+1→t Is to perform a warp alignment operation using the reverse optical flow from the t+1st frame to the t frame, M _t-1→t And M _t+1→t Is a binary mask consisting of 0 and 1, +.;

the output of the optical flow alignment module includes a preliminary color map C ^t Omega (C) obtained by optical flow distortion ^t-1 )、ω(C ^t+1 )、ω(C ^t-1 ) And C ^t Confidence map A of (2) _t-1→t 、ω(C ^t+1 ) And C ^t Confidence map B of (1) _t+1→t The method comprises the steps of carrying out a first treatment on the surface of the The method for calculating the confidence map comprises the following steps:

wherein, beta is a parameter for adjusting the value of the confidence map; c (C) ^t A preliminary coloring map representing a t-th frame; omega _t-1→t (C ^t-1 ) Is to use the forward optical flow information from the t-1 th frame to the t-th frame for C ^t-1 Results, ω, obtained after performing the warp alignment operation _t+1→t (C ^t+1 ) Is to use the forward optical flow information from the t+1st frame to the t frame for C ^t+1 A result obtained after the twisting alignment operation is performed;

s7, inputting the output of the preliminary coloring network and the output of the optical flow alignment module to the enhanced coloring network to obtain a final output, and utilizing L ₁ Calculating the error of the final output and the true value by the norm; the L is ₁ The norm loss function is:

2. The method according to claim 1, wherein the adjacent similar region in step S4 refers to a bilateral space calculated by using red, green, blue, lateral distance and longitudinal distance information in the true value picture, and is used for indicating a pixel region having similar color information to a pixel in the adjacent region.

3. The method for automatically coloring video based on a deep neural network according to claim 2, wherein the bilateral space first calculates the bilateral distance between the target pixel and the other pixels:

4. A video automatic coloring method based on deep neural network according to any one of claims 1 to 3, wherein the preliminary coloring network and the enhanced coloring network adopt the same network structure, and input data passes through one convolution layer, then passes through a plurality of continuous downsampling blocks, then passes through the convolution layer, and then passes through a plurality of continuous upsampling blocks, wherein the output of each downsampling block convolution layer is spliced into the input of the upsampling block convolution layer of the same scale in the channel dimension.