CN113132655A

CN113132655A - HDR video synthesis method based on deep learning

Info

Publication number: CN113132655A
Application number: CN202110252970.0A
Authority: CN
Inventors: 侯向辉; 蔡泽永; 李昕虎
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2021-07-16

Abstract

The invention relates to a HDR video synthesis method based on deep learning, which combines a coding and decoding structure with a Unet structure, uses Resnet to further optimize in the middle of a network structure, abstracts the characteristics of various layers of images and finally obtains HDR images. The method has great significance for generating high-quality HDR video by using the LDR video with multiple exposures recorded by a common camera.

Description

HDR video synthesis method based on deep learning

Technical Field

The invention relates to the technical field of video synthesis, in particular to a HDR video synthesis method based on deep learning.

Background

Most digital image and video material currently available captures only a small portion of the visual information visible to the human eye and is not of sufficient quality for reproduction by next generation display devices. The limiting factors are the limited color gamut and the limitation of the dynamic range (contrast) captured by the camera and stored by most image and video formats. Conventional low contrast range and limited gamut imaging (LDR imaging) is limited to only three 8-bit integer color channels and does not provide the precision required by recent developments in image capture, processing, storage and display technology.

High Dynamic Range (HDR) techniques have emerged to increase the acceptable luminance range of images. After increasing the dynamic range of a picture, much information that would otherwise be impossible to display due to being too dark or too bright can be displayed on the image. Compared with HDR picture processing techniques, HDR video processing techniques are more difficult. Unlike the processing of a single HDR picture, HDR video needs to achieve consistency between frames on the basis of single picture processing. HDR video processing techniques are of less interest than HDR picture processing techniques due to their higher difficulty. However, HDR video processing technology has many advantages, including reducing the cost and hardware burden of current mobile phones and improving video quality, so the development of HDR video processing technology is a necessary and tedious task.

In order to better and faster solve the problem of HDR video processing, experts at home and abroad develop a plurality of algorithms, but the effect is not very ideal. The first HDR video reconstruction algorithm, specifically designed for alternating exposure sequences, is proposed as one that uses optical flow to align adjacent frames with reference frames. They then use a weighting strategy to merge adjacent frames to avoid ghosting. However, in the case of large-scale motion in the image, their method usually introduces an optical flow aerial image in the final result. As another example, Mangiat and Gibson improved the above method using a block-based motion estimation method and optimization step, however, their methods still show occluded motion artifacts in case of large-scale motion. And Kalantari et al propose patch-based optimization to systematically synthesize missing exposures for each frame, however it takes a long time to generate an HDR frame and ghost artifacts or unstable and unnatural motion may be generated. Finally, some recent methods propose HDR video generation as a maximum a posteriori estimation problem, but this method is time consuming, and about 2 hours is not enough to generate a frame with a resolution of 1280 × 720. In addition, their method produces results that are noisy, with ghosting and discoloration occurring in complex cases.

Disclosure of Invention

The present invention combines the encoding and decoding structure with the Unet structure (the Unet network architecture is a network structure specially for processing the image segmentation problem), further optimizes the network structure by using the Resnet (the Resnet is a network composed of residual blocks, and the residual blocks are basically a hierarchical structure formed by adding a shortcut connection (shortcut-connection) at the beginning and the end of several continuous layers), abstracts the characteristics of various layers of the image, and finally obtains the HDR image. The method has great significance for generating high-quality HDR video by using the LDR video with multiple exposures recorded by a common camera.

The invention achieves the aim through the following technical scheme: a HDR video synthesis method based on deep learning comprises the following steps:

(1) selecting a video picture data set of the LiU HDRv hierarchy, and finding Astronauts, bridge and bridge _2 as a training data set and a testing data set; converting the image format into an LDR (low density digital image) by utilizing Luminince-hdr-2.5.1;

(2) segmenting the data set picture to obtain a plurality of small pictures to increase the training data volume; converting the file into tfrecrd and storing the tfrecrd in a save _ train folder; extracting continuous three-frame LDR images from the data set each time, enabling the images to be more consistent with pictures shot in real life by using segmentation, standardization, overturning and deformation operations, and storing the pictures into tfrecrd;

(3) reading tfrecord to obtain a plurality of continuous LDR frames (I)₁，I₂，I₃Throwing into a neural network for training; the neural network will mediate HDR (I) of this frame_r) Taking the image as a reference frame, and calculating a final loss function; obtaining an HDR frame with comprehensive information through neural network learning; thereby enabling the compositing of HDR video.

Preferably, the HDR video synthesizing method based on deep learning further includes:

(4) evaluating the synthesis quality of the HDR video by adopting the peak signal-to-noise ratio as an evaluation standard; the peak signal-to-noise ratio ranges from 0,100, and the higher the PSNR value is, the better the generated picture quality is, i.e. the capability of the neural network to reconstruct the HDR picture is high.

Preferably, the step (2) is specifically: converting the small graph from BGR to LDR in RGB format; randomly turning the image by 90 degrees, 180 degrees or 270 degrees, and randomly mirroring and reversing; cutting the image into 256 × 256 format, packaging 3 LDR images and a corresponding HDR image in a format of batchsize equal to 20, and converting the LDR images and the HDR image into tfrecrd for storage; wherein Tfrecord is a format for storing a series of binary files.

Preferably, the neural network design in step (3) is specifically:

(I) the coding layer adopts a network structure of three convolution levels; each level is a convolution layer and a pooling layer for abstracting the characteristics of three levels; the convolutional layer uses a kernel with the length and the width of 5, the step length is equal to 2, and the padding mode is VALID; in order to reduce the length and width of the tensor by half every time of convolution, a part of mirror inversion is added on the upper side, the lower side, the left side and the right side of the tensor respectively, so that the length and width of the generated matrix are exactly half of the original length and width; the number of convolution kernels of the first layer is 64, the number of convolution kernels of the second layer is 128, and the number of convolution kernels of the third layer is 256; the pooling layer adopts a linear rectification function;

(II) merging layers, wherein the three images become three tensors with the same format after passing through three encoders respectively; fusing the three tensors in the third dimension, and enabling the three tensors to pass through a convolution level, wherein the kernel parameter is the same as the Encoder, but the number of the kernel parameter is 512; then, inputting the obtained tensor into a residual error network (Resnet); wherein the residual error network has nine blocks; each residual block consists of two continuous convolution layers and a normalization layer, and each residual block ensures that the input format and the output format are the same;

(III) a decoding layer, wherein the decoding layer uses a four-layer network structure corresponding to a merging layer and a coding layer, each layer is merged on a third dimension with a layer with the same depth in front of a residual error network, and then an deconvolution layer and a normalization layer are introduced; the length and width of each layer of output tensor are 2 times of the input, the number of channels is half of the input, except the last layer; the number of channels in the last layer is equal to 3.

Preferably, the loss function in step (3) is a non-negative real-valued function, and the loss function specifically adopted is L2 loss; the formula is as follows:

L₂＝|f(x)-Y|²

L′₂＝2f′(x)(f(x)-Y)

where Y is the true value and f (x) is the predicted value.

Preferably, the step (3) is specifically as follows:

(3.1) reading tfrecrd and inputting the tfrecrd into a neural network, respectively reading three different variables by using three encoders, converting the three variables by using a three-layer convolution layer and a one-layer normalization layer, and inputting the variables into a merging layer; inputting the data into a decoding layer through further conversion of Resnet; obtaining the tensor of the HDR image finally through a decoding layer formed by deconvolution and combination operation;

(3.2) adopting an Adam optimizer, calculating the L2loss of the generated image and the reference frame, and reversely propagating to update the weight;

(3.3) tone mapping output, using a global tone mapping technique to map the HDR image into an LDR image, stored into a file, in the format png; after the tensor is finally obtained, because the image needs to be visualized, the HDR image is converted into an LDR image; converting the output numerical range into 0-255 by utilizing tone mapping output; wherein the tone mapping function is:

where μ is equal to 5000, x_i，jIs the value of the original tensor; the base of Log is e.

Preferably, the peak SNR in step (4) is an objective standard for evaluating an image, which is a logarithmic value of a mean square error between an original image and a processed image relative to (2^ n-1) ^2, i.e. a square of a maximum value of a signal, where n is a number of bits per sample and has a unit of dB; the specific calculation formula is as follows:

where MSE represents the mean square error of the current image X and the reference image Y, H, W being the height and width of the images, respectively; MAX_IIs the maximum value representing the color of the image point, and if each sample point is represented by 8 bits, the value is 255.

The invention has the beneficial effects that: the method can realize the HDR video synthesis network based on the coding and decoding structure to generate the high-quality HDR video, and has great significance for generating the high-quality HDR video by the multi-exposure LDR video.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of the overall architecture of the network of the present invention;

FIG. 3 is a schematic diagram of an overview of the coding layer of the present invention;

FIG. 4 is a schematic diagram of the convolution hierarchy of the present invention;

FIG. 5 is a schematic diagram of the overall architecture of the merging layer of the present invention;

FIG. 6 is a detailed design diagram of the residual block of the present invention;

FIG. 7 is a block diagram of the decoding layer architecture of the present invention.

Detailed Description

The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto:

example (b): as shown in fig. 1 and fig. 2, a method for synthesizing an HDR video based on deep learning specifically includes the following steps:

(1) selecting a video picture data set of the LiU HDRv hierarchy, and finding Astronauts, bridge and bridge _2 as a training data set and a testing data set; luminence-hdr-2.5.1 was used to convert the image format to LDR.

(2) And segmenting the data set picture to obtain thirty thousand small pictures to increase the training data volume. And converts it to tfrecrd, which is stored in the save _ train folder. Then, three continuous frames of LDR images are extracted from the data set each time, and the images are more consistent with the pictures shot in real life by using the operations of segmentation, standardization, turnover, deformation and the like and are stored in tfrecrd. This step may increase the robustness of the procedure, preventing overfitting. The method specifically comprises the following steps: firstly, converting the BGR into an LDR in an RGB format; then the image is randomly turned over by 90 degrees, 180 degrees or 270 degrees and is randomly mirror-inverted so as to reduce the dependence of the network on the image direction; and then cut into 256 by 256 format. And finally packaging the 3 LDR images and the corresponding HDR image in a form that the batch size is equal to 20, and converting the batch size into tfrecord for storage. Tfrecord is a format for storing a series of binary files, and the binary data occupies less space on a disk and requires less time for copying, so that it can speed up reading of file information. Therefore, tfrecrd is used in the system to store the training data, and since the original file is a BGR image, it is first converted into RGB format.

(3) Reading tfrecord to obtain three continuous LDR frames I₁，I₂，I₃And after the algorithm is started, putting the algorithm into a network for training. The neural network will mediate HDR (I) of this frame_r) The image is used as a reference frame, and a final Loss function is calculated. After learning of the neural network, a HDR frame with comprehensive information can be obtained. Where the loss function is the core part of the empirical risk function and is also an important part of the structural risk function. The degree of inconsistency between the predicted value f (x) and the true value Y for the estimation model. In other words, it can be interpreted as the difference between the predicted value and the true value Y. It is a non-negative real-valued function. The smaller the loss function, the better the robustness of the model. The training results are approximately similar to the real situation. The loss function used by the present invention is L2 loss. The formula is as follows:

L₂＝|f(x)-Y|²

L′₂＝2f′(x)(f(x)-Y)

the step (3) specifically comprises the following steps:

(3.1) reading tfrecrd, inputting the tfrecrd into a network, reading three different variables by using three encoders respectively, converting the three variables by using three convolution layers and one normalization layer, and inputting the variables into a merging layer; inputting the data into a decoding layer through further conversion of Resnet; and finally obtaining the tensor of the HDR image after passing through a decoding layer consisting of deconvolution and combination operations.

(3.2) using an Adam optimizer, calculate the L2loss of the generated image and reference frame and propagate backward to update the weights. Adam is used for optimizing the model, the Adam is an existing excellent optimization function, the Adagarad is good at processing sparse gradients, RMSprop is good at processing non-stationary targets, the requirement on memory is low, and the time required for convergence is shortened.

(3.3) tone mapping output, mapping the HDR image to an LDR image using global tone mapping technique, storing into a file, in format png. After the tensor is finalized, the HDR image is converted to an LDR image since the image needs to be visualized. A tone mapping output is used. The numerical range of the output is converted to 0-255.

The tone mapping function is:

where μ is equal to 5000, x_i，jIs the value of the original tensor. The base of Log is e.

The neural network design specifically comprises the following steps:

(i) and the coding layer adopts a network structure of three convolution levels, as shown in figure 3. Each level is a convolutional layer plus a pooling layer to abstract the features of the three levels. The convolutional layer uses a kernel with a length and width of 5, the step size is equal to 2, and the padding mode is VALID. In order to reduce the length and width of the tensor to half after each convolution, a part of mirror inversion is added on the upper, lower, left and right sides of the tensor, so that the length and width of the generated matrix are exactly half of the original length and width. The number of convolution kernels for the first layer is 64, the second layer is 128, and the third layer is 256. The pooling layer uses a linear rectification function. The convolution hierarchy is shown in fig. 4.

(ii) In the merging layer, the three images pass through three encoders respectively, and become three tensors with the same format. The algorithm fuses the three tensors in the third dimension. And then passed through a convolution level with kernel parameters identical to Encoder but with a number of 512. The resulting tensor then begins to be input into the residual network (Resnet). The residual network has nine blocks. Each residual block is composed of two successive convolutional layers and a normalization layer. Each residual block ensures that the input and output formats are the same. The overall architecture of the merging layer is shown in fig. 5. The detail design of the residual block is shown in fig. 6.

(iii) And a decoding layer using a four-layer network structure corresponding to the merging layer and the encoding layer. Each layer is merged in the third dimension with the same depth levels before the residual network, and then the deconvolution layer and the normalization layer are introduced. The length and width of each layer of output tensor are both 2 times of the input, and the number of channels is half of the input except the last layer. The number of channels in the last layer is equal to 3. So that the tensor format at the final output is identical to that at the input. The overall architecture of the decoding layer is shown in fig. 7.

(4) In the system, a peak signal to noise ratio (PSNR) is used as an evaluation standard, the PSNR ranges from [0,100], and the higher the PSNR value is, the better the generated picture quality is represented, namely the capability of reconstructing an HDR picture by a neural network is high. The method specifically comprises the following steps: PSNR is an abbreviation for "Peak Signal to Noise Ratio," i.e., Peak Signal-to-Noise Ratio, and is an objective criterion for evaluating images. Generally, after image compression, the output image is different from the original image to some extent. In order to measure the quality of processed images, the PSNR value is usually used to measure whether a certain processing procedure is satisfactory. It is the log of the mean square error between the original image and the processed image relative to (2^ n-1) ^2 (the square of the maximum value of the signal, n is the number of bits per sample), and its unit is dB. The specific calculation formula is as follows:

in summary, the present invention is significant for generating high-quality HDR video from multi-exposure LDR video.

While the invention has been described in connection with specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A HDR video synthesis method based on deep learning is characterized by comprising the following steps:

2. The method of claim 1, wherein the HDR video synthesis method based on deep learning is as follows: the HDR video synthesis method based on deep learning further comprises the following steps:

3. The method of claim 1, wherein the HDR video synthesis method based on deep learning is as follows: the step (2) is specifically to convert the small image from BGR to LDR in RGB format; randomly turning the image by 90 degrees, 180 degrees or 270 degrees, and randomly mirroring and reversing; cutting the image into 256 × 256 format, packaging 3 LDR images and a corresponding HDR image in a format of batchsize equal to 20, and converting the LDR images and the HDR image into tfrecrd for storage; wherein Tfrecord is a format for storing a series of binary files.

4. The method of claim 1, wherein the HDR video synthesis method based on deep learning is as follows: the neural network design in the step (3) is specifically as follows:

5. The HDR video synthesis method based on deep learning as claimed in claim 4, wherein: the loss function in the step (3) is a non-negative real value function, and the specifically adopted loss function is L2 loss; the formula is as follows:

L₂＝|f(x)-Y|²

L′₂＝2f′(x)(f(x)-Y)

where Y is the true value and f (x) is the predicted value.

6. The method of claim 5, wherein the HDR video synthesis method based on deep learning is characterized in that: the step (3) is specifically as follows:

7. The method of claim 2, wherein the HDR video synthesis method based on deep learning is as follows: the peak signal-to-noise ratio in the step (4) is an objective standard for evaluating the image, and is a logarithmic value of a mean square error between the original image and the processed image relative to (2^ n-1) ^2, namely a square of a maximum value of a signal, wherein n is a bit number of each sampling value and a unit is dB; the specific calculation formula is as follows: