CN115209119B - Video automatic coloring method based on deep neural network - Google Patents

Video automatic coloring method based on deep neural network Download PDF

Info

Publication number
CN115209119B
CN115209119B CN202210678884.0A CN202210678884A CN115209119B CN 115209119 B CN115209119 B CN 115209119B CN 202210678884 A CN202210678884 A CN 202210678884A CN 115209119 B CN115209119 B CN 115209119B
Authority
CN
China
Prior art keywords
coloring
frame
network
preliminary
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210678884.0A
Other languages
Chinese (zh)
Other versions
CN115209119A (en
Inventor
晋建秀
杨镒彰
郭锴凌
徐向民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202210678884.0A priority Critical patent/CN115209119B/en
Publication of CN115209119A publication Critical patent/CN115209119A/en
Application granted granted Critical
Publication of CN115209119B publication Critical patent/CN115209119B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N9/00Details of colour television systems
    • H04N9/64Circuits for processing colour signals
    • H04N9/648Video amplifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N9/00Details of colour television systems
    • H04N9/64Circuits for processing colour signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N9/00Details of colour television systems
    • H04N9/64Circuits for processing colour signals
    • H04N9/73Colour balance circuits, e.g. white balance circuits or colour temperature control
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video automatic coloring method based on a deep neural network, which comprises the following steps: acquiring an original color video data set, and converting a color video into a black-and-white video to obtain a black-and-white video frame sequence and a color video frame sequence for network training; respectively calculating the forward and reverse optical flows between two adjacent frames in the color video frame sequence and the black-and-white video frame sequence; selecting three adjacent frames from the data set, and inputting the three adjacent frames into a feature extraction network to extract a feature information graph; calculating adjacent similar areas of each pixel point in the target image; inputting the characteristic information graph into a preliminary coloring network to obtain a plurality of preliminary coloring graphs of each frame of image; inputting a three-frame preliminary coloring chart to an optical flow alignment module, and restraining the preliminary coloring network by using a time sequence loss function; inputting the output of the preliminary coloring network and the output of the optical flow alignment module into the enhanced coloring network to obtain a final output, and utilizing L 1 The norm calculates the error of the final output and the true value.

Description

Video automatic coloring method based on deep neural network
Technical Field
The invention relates to the technical field of video processing, in particular to an automatic video coloring method based on a deep neural network.
Background
The video we now see is mostly colored. The color is an indispensable element in the process from shooting to processing to being watched. However, in the past, the shooting technique has not been able to produce color video, and many excellent movie works still have been produced. These excellent video works are, without exception, black and white video, and are hardly uncomfortable for people who are accustomed to watching color video. In order to preserve the traditional excellent movie works, which are better accepted by modern people, black and white video coloring (video colorization) technology is proposed to color black and white video. In the image coloring method, the focus of the study is on the spatial correlation within the frame, and the mapping relationship of black and white to color is learned by studying the spatial information within the colored frame. In video shading methods, however, multi-frame shading should take into account the temporal correlation from frame to generate coherent shading content. Thanks to the development of the field of deep learning in the field of computer vision, a method for automatically coloring video using a deep neural network is widely used. Coloring with deep neural networks generally achieves better results than traditional coloring methods.
Existing video coloring algorithms can be broadly divided into reference frame-based coloring methods and reference frame-free direct coloring methods. The reference frame coloring method is used for obtaining one or more high-quality color reference frames through a manual coloring method, and color information is transmitted into other frames in a video sequence by utilizing a similarity matrix, so that the coloring process of the video is realized. The method of direct coloring without reference frame directly colors the video sequence without reference frame.
The video coloring method has the following problems: 1) In video coloring methods, the problem of video sequence discontinuity is the most important one. The same object in the video may have different pixel values in different frames, resulting in video jitter or artifacts. A common approach to solving the problem of video sequence discontinuity today is optical flow alignment, but most of these approaches only consider unidirectional optical flow. 2) The coloring problem is a one-to-many problem in that an input frame can correspond to a plurality of output frames without coloring the reference frame, that is, coloring a black-and-white object, with an uncertainty as a result. In video coloring, the same object in two frames that are farther apart may have a significantly different color.
Currently, deep neural networks have achieved tremendous efforts in dealing with video coloring problems, and most of the existing video coloring methods are implemented based on deep neural networks. In Learning blind video temporal consistency, in ECCM2018, wei-creating Lai et al, each frame of a video is first colored by a picture coloring method, and then the sequential consistency of the video is improved by using forward optical flow information between adjacent frames and forward optical flow information between frames farther apart. Chenyang Lei et al, fully Automatic Video Colorization with Self-Regularization and Diversity [ J ].2019, propose a diversity sense loss function such that the coloring results tend to be uniform, and the forward optical flow information between adjacent frames is utilized to improve the timing consistency of the video. However, the above methods only use the forward optical flow information, and do not use the backward optical flow information.
Disclosure of Invention
Based on the description of the problems, the invention provides an automatic video coloring method based on a deep neural network, which solves the problem of uncertain coloring results and simultaneously utilizes forward optical flow information and backward optical flow information to realize the automatic video coloring method, wherein an optical flow alignment module utilizes a bidirectional optical flow alignment method comprising forward optical flow and backward optical flow to solve the problem of incoherence of coloring video sequences, and utilizes a diversity loss function to solve the problem of uncertain video coloring results.
The invention is realized at least by one of the following technical schemes.
A video automatic coloring method based on a deep neural network comprises the following steps:
s1, acquiring an original color video data set, and converting a color video into a black-and-white video to obtain a black-and-white video frame sequence and a color video frame sequence for network training;
s2, respectively calculating forward and backward optical flows between two adjacent frames in the color video frame sequence and the black-and-white video frame sequence;
s3, selecting three adjacent frames from the black-and-white data set, and inputting the three adjacent frames into a feature extraction network to extract a feature information graph;
s4, calculating adjacent similar areas of each pixel point in the target image, and restraining the preliminary coloring network by using a consistency loss function in the adjacent similar areas;
s5, inputting the characteristic information graph into a preliminary coloring network to obtain a plurality of preliminary coloring graphs of each frame of image, and restraining the preliminary coloring network by using a diversity loss function on the preliminary coloring graph;
s6, inputting a three-frame preliminary coloring chart into an optical flow alignment module, and simultaneously restraining a preliminary coloring network by utilizing a time sequence loss function;
s7, inputting the output of the preliminary coloring network and the output of the optical flow alignment module to the enhanced coloring network to obtain a final output, and utilizing L 1 The norm calculates the error of the final output and the true value.
Further, the adjacent similar region in step S4 refers to a bilateral space calculated in the real-valued picture using red, green, blue, lateral distance and longitudinal distance information, and is used to indicate a pixel region having similar color information to a pixel in the adjacent region.
Further, the bilateral space first calculates the bilateral distance between the target pixel point and the other pixel points:
Figure BDA0003696724910000021
where r, g, b are differences in three channel pixel values of the target pixel point and the remaining pixel points in the RGB color space R, G, B, respectively, w and h are lateral distances and longitudinal distances between the target pixel point and the remaining pixel points, and λ is a weight for balancing the spatial distance and the color distance.
Further, the consistency loss function is as follows:
Figure BDA0003696724910000031
where t denotes the current t frame, n denotes the video sequence for n frames, p and q denote pixel positions, Y t A true value image representing the t-th frame,
Figure BDA0003696724910000032
representing the slave Y t The adjacent similar region, X, of the pixel p obtained in (a) t Representing an input t-th frame original image, f p (X t ) X represents t After inputting the preliminary coloring network f, obtaining the pixel value of the p position in the coloring chart, f q (X t ) X represents t After the preliminary coloring network f is input, a coloring chart is obtainedPixel values of q positions of (c).
Further, step S5 obtains any of a plurality of preliminary coloring images, and selects a picture with the highest average saturation of pixels in the preliminary coloring image from the plurality of preliminary coloring images of each frame of image as an input of the optical flow alignment module.
Further, the diversity loss function described in step S5 is:
Figure BDA0003696724910000033
where t denotes the current t frame and n denotes the total of n frames of the video sequence, C t (i) Representing an ith preliminary coloring picture output by a t-th frame through a preliminary coloring network, Y t True value image representing the t-th frame, alpha i Representing a decreasing number sequence, phi represents a feature map extracted from the pre-trained VGG-19 network, and d represents the number of preliminary coloring pictures generated.
Further, the timing loss function in step S6 is a forward backward timing consistency loss function:
Figure BDA0003696724910000034
where t denotes the current t frame, n denotes the video sequence for n frames, X t Represents an input t-th frame original image, f (X t ) Representing an input t-th frame original image X t After inputting the preliminary coloring network f, a coloring diagram omega is obtained t-1→t Is to use t-1 frame to forward optical flow of t frame for distortion alignment operation, omega t+1→t Is to perform a warp alignment operation using the reverse optical flow from the t+1st frame to the t frame, M t-1→t And M t+1→t Is a binary mask consisting of 0 and 1, +..
Further, the output of the optical flow alignment module in step S6 includes a preliminary color chart C t Omega (C) obtained by optical flow distortion t-1 )、ω(C t+1 )、ω(C t-1 ) And C t Confidence map A of (2) t-1→t 、ω(C t+1 ) And C t Confidence map B of (1) t+1→t The method comprises the steps of carrying out a first treatment on the surface of the The method for calculating the confidence map comprises the following steps:
Figure BDA0003696724910000041
Figure BDA0003696724910000042
wherein, beta is a parameter for adjusting the value of the confidence map; c (C) t A preliminary coloring map representing a t-th frame; omega t-1→t (C t-1 ) Is to use the forward optical flow information from the t-1 th frame to the t-th frame for C t-1 Results, ω, obtained after performing the warp alignment operation t+1→t (C t +1 ) Is to use the forward optical flow information from the t+1st frame to the t frame for C t+1 And (3) performing a distortion alignment operation.
Further, L is described in step S7 1 The norm loss function is:
Figure BDA0003696724910000043
where t denotes the current t frame, n denotes the total of n frames of the video sequence, g (f (X) t-1 ),f(X t ),f(X t+1 ) (ii) represents the output O of the enhanced coloring network g t ,Y t Representing the true value of the t frame.
Further, the preliminary coloring network and the enhanced coloring network adopt the same network structure, input data firstly passes through a convolution layer, then passes through a plurality of continuous downsampling blocks, then passes through the convolution layer, and then passes through a plurality of continuous upsampling blocks, wherein the output of each downsampling block convolution layer is spliced into the input of the upsampling block convolution layer with the same dimension in the channel dimension.
Compared with the prior art, the invention has the beneficial effects that:
according to the method, the adjacent similar areas are calculated, pixels in the adjacent similar areas are kept the same or similar, so that a coloring picture which is smooth in space is obtained, a plurality of preliminary coloring pictures are used, a diversity loss function is added, the problem that coloring results are not uniform in video coloring problems is solved, and the results tend to be uniform; through the optical flow relation among the three frames of pictures, the time sequence consistency of coloring results is enhanced, artifacts and jitter phenomena in the coloring results are reduced, and the coloring video colors are consistent and uniform by calculating the forward optical flow distortion of the first frame to the second frame and the backward optical flow distortion of the third frame to the second frame.
Drawings
FIG. 1 is a flow chart of an embodiment of a video automatic coloring method based on a deep neural network;
FIG. 2 is a schematic diagram of a video automatic coloring method based on a deep neural network according to an embodiment;
FIG. 3 is a schematic diagram of the input and output architecture of a preliminary coloring network;
FIG. 4 is a schematic diagram of an optical flow alignment module;
FIG. 5 is a schematic diagram of the structure of a preliminary coloring network and an enhanced coloring network.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not to be construed as limiting the present invention.
Example 1
Step S1, an original color video data set is obtained, color video is converted into black-and-white video, and a black-and-white video frame sequence and a color video frame sequence for network training are obtained.
Specifically, the present embodiment uses the DAVIS data set and the VIDEVO data set to train and test the preliminary and enhanced coloring networks. In addition, an ImageNet dataset was used to train the preliminary coloring network. Each frame picture of video in the video dataset is converted from RGB (red (R), green (G), blue (B)) color space to YCbCr (Y refers to the luminance component, cb refers to the blue chrominance component, and Cr refers to the red chrominance component) color space, leaving only the Y channel in the image YCbCr and taking it as the original input of the model.
And S2, respectively calculating the forward and backward optical flows between two adjacent frames in the color video frame sequence and the black-and-white video frame sequence. Specifically, in calculating optical flow, a pre-trained optical flow model is used that is capable of predicting pixel movement between any two adjacent video frames and outputting the result in the form of a feature pattern with a channel number of 2. These two channels represent lateral and longitudinal movements in a two-dimensional plane, respectively, representing optical flow information between two adjacent frames. And respectively inputting the color video frame sequence and the black-and-white video frame sequence of the data set into a pre-trained PWC-Net optical flow model in the forward direction and the reverse direction to obtain the forward optical flow information and the reverse optical flow information between two adjacent frames.
And S3, selecting three adjacent frames from the black-and-white data set as input data, and inputting the input data into a feature extraction network to extract a feature information graph as input of a preliminary coloring network.
Specifically, three adjacent frames X are randomly selected in the dataset t-1 、X t 、X t+1 As an input data set. The method adopted in the step S3 for extracting the characteristic information graph in the network is to extract the low-dimensional and high-dimensional characteristic information of the graph by utilizing a pretrained VGG-19 network for training the target classification task on the ImageNet data set. The VGG-19 network has 5 consecutive convolutional blocks, each block has multiple convolutional layers, a maximum pooling layer is connected at the end of each block, and finally output is obtained through three full connection layers. The specific method comprises the steps of inputting pictures into a pretrained VGG-19 network, extracting the output of a second convolution layer in each convolution block, obtaining five feature images in total, and obtaining an output feature image matched with the resolution of the input pictures through bilateral upsampling.
And S4, calculating adjacent similar areas of each pixel point in the target image, and restraining the preliminary coloring network by using a consistency loss function in the adjacent similar areas.
Specifically, the adjacent similar region in step S4 refers to a bilateral space calculated in the real-valued picture by using information such as red, green, blue, lateral distance, and longitudinal distance, and is used to indicate a pixel region having similar color information to a pixel in the adjacent region. The method of calculating the adjacent similar regions may be varied, including but not limited to the method used in this example.
In this example, first, a bilateral distance between a target pixel and other pixels is calculated:
Figure BDA0003696724910000061
where r, g, b are differences in three channel pixel values of the target pixel point and the remaining pixel point in the RGB color space R, G, B, respectively, w and h are lateral and longitudinal distances between the target pixel point and the remaining pixel point, and λ is a weight for balancing the spatial and color distances, in this example λ=200.
After obtaining the bilateral distance between the target pixel point and the rest of the pixel points, J pixel points closest to the bilateral distance of the target pixel point are taken to form an adjacent similar region of the target pixel, and in this example, j=10.
Since it is desirable that adjacent pixels in the colored picture have the same or similar color as much as possible, a consistency loss function l is used in the adjacent similar region b To calculate pixel differences in the adjacent similar regions and then to pair l in the preliminary coloring network b And (3) performing constraint so that pixels in adjacent similar areas tend to be consistent as much as possible, and pixel points in the adjacent similar areas have similar pixel values, so that the coloring result of the picture is as smooth as possible.
The consistency loss function is as follows:
Figure BDA0003696724910000062
where t denotes the current t frame, n denotes the video sequence for n frames, p and q denote pixel positions, Y t True value graph representing the t-th frameIn the case of an image of a person,
Figure BDA0003696724910000063
representing the slave Y t The adjacent similar region, X, of the pixel p obtained in (a) t Representing an input t-th frame original image, f p (X t ) X represents t After inputting the preliminary coloring network f, obtaining the pixel value of the p position in the coloring chart, f q (X t ) X represents t The pixel value at the q position in the color map is obtained after the preliminary color network f is input.
And S5, inputting the characteristic information graph into a preliminary coloring network to obtain a plurality of preliminary coloring graphs of each frame of image, and restraining the preliminary coloring network by using a diversity loss function on the preliminary coloring graph.
Specifically, as shown in FIG. 3, an input X will be entered t-1 、X t 、X t+1 Respectively inputting the characteristic information graphs of the images into a preliminary coloring network to obtain a plurality of preliminary coloring graphs respectively:
C t-1 (1),C t-1 (2),C t-1 (3)...C t-1 (d),C t (1),C t (2),C t (3)...C t (d),C t+1 (1),C t+1 (2),C t+1 (3)...C t+1 (d)。
in this example, the number of preliminary coloring pictures generated d=4, decreasing the number sequence α i Taking 0.08, 0.04, 0.02 and 0.01 in sequence. And constraining the preliminary coloring network by using a diversity loss function, so that the results of the preliminary coloring network are consistent, and the diversity of coloring results is reduced.
The diversity loss function is:
Figure BDA0003696724910000071
where t denotes the current t frame and n denotes the total of n frames of the video sequence, C t (i) Representing an ith preliminary coloring picture output by a t-th frame through a preliminary coloring network, Y t True value image representing the t-th frame, alpha i Representing a decreasing sequence of digits, phiRepresenting a feature map extracted from a pre-trained VGG-19 network, d representing the number of preliminary coloring pictures generated.
And S6, inputting the three-frame preliminary coloring map into an optical flow alignment module, and simultaneously restraining the preliminary coloring network by utilizing a time sequence loss function.
Specifically, a picture C with highest average saturation of pixels in the preliminary coloring picture is selected from a plurality of preliminary coloring pictures of each frame of picture t-1 、C t 、C t+1 Inputting the three pictures into an optical flow alignment module
The optical flow alignment module uses the forward optical flow pair C from the t-1 st frame to the t-th frame as shown in FIG. 4 t-1 Performing twisting alignment operation to obtain twisted omega t-1→t (C t-1 ) Reverse optical flow pair C using t+1st frame to t frame t+1 Performing twisting alignment operation to obtain twisted omega t+1→t (C t+1 ) And constraining the preliminary coloring network by utilizing two time sequence loss functions of forward warping and backward warping, enhancing the time sequence consistency of the coloring diagram of the preliminary coloring network, enabling the video coloring effect to be overall smooth, and eliminating artifacts.
Specifically, the warp alignment operation is exemplified as follows: the input optical flow information indicates the required movement of all pixels within the a-frame to align to the B-frame. The pixel coordinates within the a-frame should typically be added to the pixel shift to get the result of a-frame to B-frame alignment. Omega t-1→t (C t-1 ) Is to use the forward optical flow information from the t-1 th frame to the t-th frame to make the C t-1 Results, ω, obtained after performing the warp alignment operation t-1→t (C t-1 ) Should be in contact with C t And tend to be consistent. Omega t+1→t (C t+1 ) Uses the reverse optical flow information from the t+1st frame to the t frame for C t-1 Results, ω, obtained after performing the warp alignment operation t+1→t (C t+1 ) Should also be identical to C t And tend to be consistent.
The timing penalty function is a forward-backward timing consistency penalty function:
Figure BDA0003696724910000072
where t denotes the current t frame, n denotes the video sequence for n frames, X t Represents an input t-th frame original image, f (X t ) Representing an input t-th frame original image X t After inputting the preliminary coloring network f, a coloring diagram omega is obtained t-1→t Is to use t-1 frame to forward optical flow of t frame for distortion alignment operation, omega t+1→t Is to perform a warp alignment operation using the reverse optical flow from the t+1st frame to the t frame, M t-1→r And M t+1→t Is a binary mask consisting of 0 and 1, +..
Specifically, M in step S6 t-1→t And M t+1→t The binary mask is composed of 0 and 1, and has the function of eliminating the part with overlarge difference between the picture and the original picture after the optical flow distortion and reducing the error possibly caused by the optical flow alignment. The specific method comprises the steps of calculating the absolute value of each pixel difference value between a picture and an original picture after optical flow distortion, setting a threshold value M, setting a part of the difference value smaller than the threshold value M to be 1, and setting a part larger than the threshold value M to be 0, so that a binary mask is obtained. In this example, m=0.05.
S7, inputting the output of the preliminary coloring network and the output of the optical flow alignment module into the enhanced coloring network together to obtain a final output, and utilizing L 1 The norm calculates the error of the final output and the true value.
Specifically, ω is calculated as shown in fig. 4 t-1→t (C t-1 ) And C t Confidence map A of (2) t-1→t Calculate ω t+1→t (C t+1 ) And C t Confidence map B of (1) t+1→t In this example, β=1000. Will preliminary color chart C t Omega obtained after optical flow distortion t-1→t (C t-1 ),ω t+1→t (C t+1 ) The confidence map A and the confidence map B are spliced and then are input into the enhanced coloring network together to obtain the final output O of t frames t . In this process, L is used 1 Norm calculation of true value Y and final output O t Errors between them.
The method for calculating the confidence map comprises the following steps:
Figure BDA0003696724910000081
Figure BDA0003696724910000082
wherein, beta is a parameter for adjusting the value of the confidence map; omega t-1→t (C t-1 ) Is to use the forward optical flow information from the t-1 th frame to the t-th frame to make the C t-1 And (3) performing a distortion alignment operation. C (C) t A preliminary coloring map representing a t-th frame;
specifically, the preliminary coloring network and the enhanced coloring network employ the same network structure, as shown in fig. 5. The input first passes through a convolution layer with a convolution kernel 1*1 to change the latitude of the input feature map to match the latitude of the network input. Then, there are consecutive 5 downsampled blocks, each downsampled block comprising two convolutionally layers of convolution kernel 3*3 and one maximally pooled layer. And then passing through two convolution layers with a convolution kernel of 3*3, and then passing through 5 continuous upsampling blocks, wherein each upsampling block comprises an upsampling layer and two convolution layers with a convolution kernel of 3*3, and finally obtaining output. Wherein the output of each downsampled block convolutional layer is spliced in the channel dimension into the input of an upsampled block convolutional layer of the same dimension. The activation function in the network is uniformly set to LeakyReLu.
The L is 1 The norm loss function is:
Figure BDA0003696724910000083
where t denotes the current t frame, n denotes the total of n frames of the video sequence, g (f (X) t-1 ),f(X t ),f(X t+1 ) (ii) represents the output O of the enhanced coloring network g t ,Y t Representing the true value of the t frame.
During training, the preliminary coloring network is trained first. In each training period, follow-up in the ImageNet datasetMachine sampling 5000 pictures using a consistency loss function l b And a diversity loss function l d Training is performed, then 1000 sets of adjacent three frames are randomly sampled in the DAVIS, using the timing consistency loss function l t To perform training. When training the enhanced coloring network, firstly, generating the input of the enhanced coloring network by inputting the preliminary coloring network and the optical flow alignment module, and then utilizing the final output and the L of the true value 1 And calculating errors by norms and training. The data sets that can be used are diverse, including but not limited to those used in this example
During testing, the black-and-white test set and the black-and-white optical flow sequence of the test set are input into a network to obtain a colored video. Comparing the output result with the original video, and calculating PSNR (Peak Signal to Noise Ratio, namely peak signal to noise ratio), LPIPS (Learned Perceptual Image Patch Similarity, namely effectiveness of depth feature measurement image similarity) indexes; the three-input video coloring method can achieve good effects.
Example 2
The pre-trained optical flow model of this embodiment may select optical flow models such as FlowNet, flowNet, SPyNet, etc.
Example 3
In this example, the method for calculating the bilateral distance between the target pixel and the other pixels is as follows:
r 3 +g 3 +b 3 +λ*w+λ*h
where r, g, b are differences in three channel pixel values of the target pixel point and the remaining pixel point in the RGB color space R, G, B, respectively, w and h are lateral and longitudinal distances between the target pixel point and the remaining pixel point, and λ is a weight for balancing the spatial and color distances, in this example λ=0.01.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (4)

1. The automatic video coloring method based on the deep neural network is characterized by comprising the following steps of:
s1, acquiring an original color video data set, and converting a color video into a black-and-white video to obtain a black-and-white video frame sequence and a color video frame sequence for network training;
s2, respectively calculating forward and backward optical flows between two adjacent frames in the color video frame sequence and the black-and-white video frame sequence;
s3, selecting three adjacent frames from the black-and-white data set, and inputting the three adjacent frames into a feature extraction network to extract a feature information graph;
s4, calculating adjacent similar areas of each pixel point in the target image, and restraining the preliminary coloring network by using a consistency loss function in the adjacent similar areas; the consistency loss function is as follows:
Figure FDA0004168573150000011
where t denotes the current t frame, n denotes the video sequence for n frames, p and q denote pixel positions, Y t A true value image representing the t-th frame,
Figure FDA0004168573150000013
representing the slave Y t The adjacent similar region, X, of the pixel p obtained in (a) t Representing an input t-th frame original image, f p (X t ) X represents t After inputting the preliminary coloring network f, obtaining the pixel value of the p position in the coloring chart, f q (X t ) X represents t Inputting a preliminary coloring network f to obtain a pixel value of a q position in a coloring chart;
s5, inputting the characteristic information graph into a preliminary coloring network to obtain a plurality of preliminary coloring graphs of each frame of image, restraining the preliminary coloring network on the preliminary coloring graphs by utilizing a diversity loss function to obtain any plurality of preliminary coloring graphs, and selecting a picture with highest average saturation of pixels in the preliminary coloring graph from the plurality of preliminary coloring pictures of each frame of image as an input of an optical flow alignment module;
the diversity loss function is as follows:
Figure FDA0004168573150000012
where t denotes the current t frame and n denotes the total of n frames of the video sequence, C t (i) Representing an ith preliminary coloring picture output by a t-th frame through a preliminary coloring network, Y t True value image representing the t-th frame, alpha i Representing a decreasing number sequence, phi representing a feature map extracted from the pre-trained VGG-19 network, d representing the number of preliminary coloring pictures generated;
s6, inputting a three-frame preliminary coloring chart into an optical flow alignment module, and simultaneously restraining a preliminary coloring network by utilizing a time sequence loss function; the timing penalty function is a forward-backward timing consistency penalty function:
Figure FDA0004168573150000021
where t denotes the current t frame, n denotes the video sequence for n frames, X t Represents an input t-th frame original image, f (X t ) Representing an input t-th frame original image X t After inputting the preliminary coloring network f, a coloring diagram omega is obtained t-1→t Is to use t-1 frame to forward optical flow of t frame for distortion alignment operation, omega t+1→t Is to perform a warp alignment operation using the reverse optical flow from the t+1st frame to the t frame, M t-1→t And M t+1→t Is a binary mask consisting of 0 and 1, +.;
the output of the optical flow alignment module includes a preliminary color map C t Omega (C) obtained by optical flow distortion t-1 )、ω(C t+1 )、ω(C t-1 ) And C t Confidence map A of (2) t-1→t 、ω(C t+1 ) And C t Confidence map B of (1) t+1→t The method comprises the steps of carrying out a first treatment on the surface of the The method for calculating the confidence map comprises the following steps:
Figure FDA0004168573150000022
Figure FDA0004168573150000023
wherein, beta is a parameter for adjusting the value of the confidence map; c (C) t A preliminary coloring map representing a t-th frame; omega t-1→t (C t-1 ) Is to use the forward optical flow information from the t-1 th frame to the t-th frame for C t-1 Results, ω, obtained after performing the warp alignment operation t+1→t (C t+1 ) Is to use the forward optical flow information from the t+1st frame to the t frame for C t+1 A result obtained after the twisting alignment operation is performed;
s7, inputting the output of the preliminary coloring network and the output of the optical flow alignment module to the enhanced coloring network to obtain a final output, and utilizing L 1 Calculating the error of the final output and the true value by the norm; the L is 1 The norm loss function is:
Figure FDA0004168573150000024
where t denotes the current t frame, n denotes the total of n frames of the video sequence, g (f (X) t-1 ),f(X t ),f(X t+1 ) (ii) represents the output O of the enhanced coloring network g t ,Y t Representing the true value of the t frame.
2. The method according to claim 1, wherein the adjacent similar region in step S4 refers to a bilateral space calculated by using red, green, blue, lateral distance and longitudinal distance information in the true value picture, and is used for indicating a pixel region having similar color information to a pixel in the adjacent region.
3. The method for automatically coloring video based on a deep neural network according to claim 2, wherein the bilateral space first calculates the bilateral distance between the target pixel and the other pixels:
Figure FDA0004168573150000025
where r, g, b are differences in three channel pixel values of the target pixel point and the remaining pixel points in the RGB color space R, G, B, respectively, w and h are lateral distances and longitudinal distances between the target pixel point and the remaining pixel points, and λ is a weight for balancing the spatial distance and the color distance.
4. A video automatic coloring method based on deep neural network according to any one of claims 1 to 3, wherein the preliminary coloring network and the enhanced coloring network adopt the same network structure, and input data passes through one convolution layer, then passes through a plurality of continuous downsampling blocks, then passes through the convolution layer, and then passes through a plurality of continuous upsampling blocks, wherein the output of each downsampling block convolution layer is spliced into the input of the upsampling block convolution layer of the same scale in the channel dimension.
CN202210678884.0A 2022-06-15 2022-06-15 Video automatic coloring method based on deep neural network Active CN115209119B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210678884.0A CN115209119B (en) 2022-06-15 2022-06-15 Video automatic coloring method based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210678884.0A CN115209119B (en) 2022-06-15 2022-06-15 Video automatic coloring method based on deep neural network

Publications (2)

Publication Number Publication Date
CN115209119A CN115209119A (en) 2022-10-18
CN115209119B true CN115209119B (en) 2023-06-23

Family

ID=83576116

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210678884.0A Active CN115209119B (en) 2022-06-15 2022-06-15 Video automatic coloring method based on deep neural network

Country Status (1)

Country Link
CN (1) CN115209119B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116823973B (en) * 2023-08-25 2023-11-21 湖南快乐阳光互动娱乐传媒有限公司 Black-white video coloring method, black-white video coloring device and computer readable medium
CN117876279B (en) * 2024-03-11 2024-05-28 浙江荷湖科技有限公司 Method and system for removing motion artifact based on scanned light field sequence image

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673307A (en) * 2021-07-05 2021-11-19 浙江工业大学 Light-weight video motion recognition method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10776688B2 (en) * 2017-11-06 2020-09-15 Nvidia Corporation Multi-frame video interpolation using optical flow

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673307A (en) * 2021-07-05 2021-11-19 浙江工业大学 Light-weight video motion recognition method

Also Published As

Publication number Publication date
CN115209119A (en) 2022-10-18

Similar Documents

Publication Publication Date Title
CN115209119B (en) Video automatic coloring method based on deep neural network
US8472717B2 (en) Foreground image separation method
CN112288658A (en) Underwater image enhancement method based on multi-residual joint learning
CN100563303C (en) Image processing equipment, image capture device and image processing method
Niu et al. Image quality assessment for color correction based on color contrast similarity and color value difference
CN111861880B (en) Image super-fusion method based on regional information enhancement and block self-attention
CN113822830B (en) Multi-exposure image fusion method based on depth perception enhancement
Zhou et al. Multicolor light attenuation modeling for underwater image restoration
CN117640942A (en) Coding method and device for video image
CN116664454B (en) Underwater image enhancement method based on multi-scale color migration parameter prediction
CN103595981A (en) Method for demosaicing color filtering array image based on non-local low rank
CN112991371B (en) Automatic image coloring method and system based on coloring overflow constraint
CN112750092A (en) Training data acquisition method, image quality enhancement model and method and electronic equipment
CN115393227A (en) Self-adaptive enhancing method and system for low-light-level full-color video image based on deep learning
JP7463186B2 (en) Information processing device, information processing method, and program
CN113284061A (en) Underwater image enhancement method based on gradient network
CN117115058A (en) Low-light image fusion method based on light weight feature extraction and color recovery
CN115841523A (en) Double-branch HDR video reconstruction algorithm based on Raw domain
CN113935928B (en) Rock core image super-resolution reconstruction based on Raw format
CN114549386A (en) Multi-exposure image fusion method based on self-adaptive illumination consistency
CN114556897B (en) Raw to RGB image conversion
CN114627016A (en) Industrial defect detection preprocessing method based on color migration strategy
CN114445300A (en) Nonlinear underwater image gain algorithm for hyperbolic tangent deformation function transformation
CN115456903B (en) Deep learning-based full-color night vision enhancement method and system
CN112381761A (en) Robust low-illumination enhanced image quality evaluation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant