Disclosure of Invention
The invention aims to: an object is to provide an image night scene processing method based on a convolutional neural network, so as to solve the above problems in the prior art. It is a further object to propose a computing module operable to carry out the above method and a storage medium readable by the computing module.
The technical scheme is as follows: an image night scene processing method based on a convolutional neural network comprises the following steps:
step 1, collecting a plurality of groups of RAW format data samples;
step 2, designing a super night scene network model;
step 3, training the super night scene network model in the step 2;
and 4, outputting a result.
In a further embodiment, the step 1 is further:
step 1-1, shooting different scenes through a plurality of preset image acquisition devices with different models to obtain a plurality of RAW format data samples, taking the RAW format data samples acquired by the image acquisition devices with different models in the same scene as a group of parent samples, dividing the parent samples into different child samples according to the model of the image acquisition device, and marking each sample;
step 1-2, after data sample acquisition is completed, aligning the images, and removing the non-overlapping parts of the images;
Step 1-3, after the image alignment operation of step 1-2, the data samples are further divided into a training set, a verification set and a test set.
In a further embodiment, the aligning the images in the step 1-2 further includes matching key points of the images, and iterating the random subset repeatedly on the basis of the key points;
wherein the keypoints of the matching images are further as follows:
step 1-2a, searching all image positions on a preset scale space, and extracting key points including corner points, edge points, bright points of dark areas and dark points of the bright areas through convolution operation; the calculation method of the scale space L (x, y, sigma) is as follows:
L(x,y,σ)=G(x,y,σ)·C(x,y)
wherein C (x, y) represents the midpoint coordinates of the key points, G (x, y, sigma) represents a Gaussian kernel function, sigma is a scale space factor, and a constant value is taken;
wherein the gaussian kernel function is expressed as follows:
wherein each symbol is as defined above;
step 1-2b, collecting gradient modulus values of key points:
step 1-2c, collecting direction distribution of key points:
wherein each symbol is as defined above;
step 1-2d, calculating the domain point k of the key point k i :
In (x) k ,y k ) The direction of the key point is indicated, and the rest symbols are the same as the above.
In a further embodiment, the step 2 further includes:
Step 2-1, establishing an SNN super night scene network model, wherein the SNN super night scene network model comprises at least one Encoder network and at least one Decoder network, and each network comprises multiple layers; the method comprises the steps that the Encoder network has multiple downsampling, at least two 3x3 convolutions are arranged in each layer, an activation function and a Switchablenormalization layer are connected after each convolution, and finally a 2x2 max_mapping operation is added, namely, the step length is 2 for downsampling; the whole Encoder network is repeated three times;
each step in the Decoder network includes upsampling the feature map, concatenating the 3x3 convolutions halving the number of feature channels with the corresponding feature map from the Encoder network, and then convolving the concatenated feature map by two 3x3 convolutions, each convolution followed by an activation function and a switchblenormalization layer; using a 3*3 convolution layer in the last layer, and finally outputting the processed image through pixel_shuffle;
step 2-2, propose Residual Dense block, put it on skip-connection, the residual Denseblock is made up of 3 Denseblocks, there are 5 convolutions inside each Denseblock, connect an activation function and switching normalization layer after each convolution, each layer accepts the output characteristic pattern from all previous convolution layers at the same time.
In a further embodiment, step 3, the training sample set is divided into a low-resolution training set and a high-resolution image block in the process of training the super night scene network model;
the low-resolution training set is obtained by the following steps: firstly, downsampling the high-resolution image by N times to obtain different low-resolution images; expanding the obtained low-resolution images, overlapping and sampling each obtained low-resolution image to obtain a group of overlapped low-resolution image blocks, and taking the overlapped low-resolution image blocks as a low-resolution training set;
the high-resolution image block acquisition mode is as follows:
overlapping and sampling the high-resolution image corresponding to the N times of downsampling operation, and then taking the obtained group of corresponding overlapped high-resolution image blocks as a high-resolution tag image; n is a positive integer;
the expansion mode of expanding the obtained low-resolution image is to perform rotation transformation of 90 degrees, 180 degrees and 270 degrees so as to obtain low-resolution images with different angles;
a training convolutional network is then constructed:
firstly, taking a low-resolution image LR as input, extracting shallow features through a convolution layer, then learning deep features of the image through a plurality of stacked CACB modules, finally fusing the extracted shallow features and deep features, and up-sampling in a sub-pixel convolution mode to obtain a high-resolution image;
The CACB module consists of four fusion convolution layers, and one-fourth of the characteristics of each fusion convolution layer are reserved for final characteristic fusion; the structural details of the fusion convolutional layer related in the module are divided into a training stage and a deployment stage;
the loss function used in the training process is:
L total =0.5*L 1 +0.05*L SSIM +0.1*L VGG +L adv
wherein L is 1 Is the average absolute error, L SSIM For structural similarity, L VGG To perceive loss, L adv Representing countermeasures against losses;
wherein F (-) is a feature map, G (I), of the 34 th layer output of the VGG19 network pre-trained on ImageNet i,j,k C is a picture generated by a generator i,j,k And D (-) is the output of the discriminator for the corresponding original picture with the foreground effect.
In a further embodiment, step 4 further includes acquiring an image by the image acquisition sensor and outputting a night scene image after the image is finally modified by the super night scene network model obtained by training in step 3; carrying out the scene rendering before outputting the night scene image, wherein the model with the scene rendering effect picture can be specifically constructed as follows:
wherein I is
bokeh Representing the finally obtained image, I
org The original image is represented by a representation of the original image,
representing the multiplication of the matrix element by element, B
i (. Cndot.) is the ith level of blurFunction, W
i Characteristic weight matrix values representing an i-th layer data image,/- >
Involves the ith level of blurring function B
i (. About.) is a superficial fuzzy neural network->
Obtained iteratively i times, expressed as:
the loss function/adopts the combination of a reconstruction function and the structural similarity SSIM, and an error value counter-propagation optimization model is adopted; wherein l 1 The method comprises the following steps:
wherein I is
bokeh The representation model generates an image with a foreground effect,
original image representing an image actually with a foreground effect,/->
Representing the generated image I
bokeh And (3) the actual image>
The structural similarity is as follows:
wherein alpha, beta and gamma are preset constants,
representing the generated image I
bokeh And actual image
Brightness relation between->
Representing the generated image I
bokeh And (3) the actual image>
Contrast relation between->
Representing the generated image I
bokeh And (3) the actual image>
Structural relationship between the two.
An image night scene processing system based on a convolutional neural network comprises a first module for collecting a plurality of groups of RAW format data samples; the second module is used for establishing a super night scene network model; a third module for training the super night scene network model; and a fourth module for performing a scene rendering on the night scene image before output.
The first module further shoots different scenes through a plurality of preset image acquisition devices with different models to obtain a plurality of RAW format data samples, RAW format data samples acquired by the image acquisition devices with different models in the same scene are used as a group of mother samples, the mother samples are divided into different sub-samples according to the model of the image acquisition device, and each sample is marked;
After the data sample acquisition is completed, carrying out alignment operation on the images, and removing the non-overlapping part of the images; the image alignment operation comprises matching key points of the image, and repeatedly iterating for a plurality of random subsets on the basis;
wherein the keypoints of the matching images are further as follows:
searching all image positions on a preset scale space, and extracting key points including corner points, edge points, bright points of dark areas and dark points of the bright areas through convolution operation; the calculation method of the scale space L (x, y, sigma) is as follows:
L(x,y,σ)=G(x,y,σ)·C(x,y)
wherein C (x, y) represents the midpoint coordinates of the key points, G (x, y, sigma) represents a Gaussian kernel function, sigma is a scale space factor, and a constant value is taken;
wherein the gaussian kernel function is expressed as follows:
wherein each symbol is as defined above;
acquiring gradient modulus values of key points:
the direction distribution of the acquisition key points:
wherein each symbol is as defined above;
domain point k for calculating key point k i :
In (x) k ,y k ) The directions of the key points are represented, and the other symbols have the same meaning;
dividing the data sample into a training set, a verification set and a test set;
the second module is further configured to establish an SNN super night scene network model, including at least one Encoder network and at least one Decoder network, each network including multiple layers; the method comprises the steps that the Encoder network has multiple downsampling, at least two 3x3 convolutions are arranged in each layer, an activation function and a Switchablenormalization layer are connected after each convolution, and finally a 2x2 max_mapping operation is added, namely, the step length is 2 for downsampling; the whole Encoder network is repeated three times;
Each step in the Decoder network includes upsampling the feature map, concatenating the 3x3 convolutions halving the number of feature channels with the corresponding feature map from the Encoder network, and then convolving the concatenated feature map by two 3x3 convolutions, each convolution followed by an activation function and a switchblenormalization layer; using a 3*3 convolution layer in the last layer, and finally outputting the processed image through pixel_shuffle;
residual Dense block put it on skip-connection, the residual Denseblock is composed of 3 Denseblocks, there are 5 convolutions in each Denseblock, connect an activation function and a Switchablenormalization layer after each convolution, each layer accepts the output characteristic diagram from all convolution layers before at the same time;
the third module is further configured to divide the training sample set into a low resolution training set and a high resolution image block;
the low-resolution training set is obtained by the following steps: firstly, downsampling the high-resolution image by N times to obtain different low-resolution images; expanding the obtained low-resolution images, overlapping and sampling each obtained low-resolution image to obtain a group of overlapped low-resolution image blocks, and taking the overlapped low-resolution image blocks as a low-resolution training set;
The high-resolution image block acquisition mode is as follows:
overlapping and sampling the high-resolution image corresponding to the N times of downsampling operation, and then taking the obtained group of corresponding overlapped high-resolution image blocks as a high-resolution tag image; n is a positive integer;
the expansion mode of expanding the obtained low-resolution image is to perform rotation transformation of 90 degrees, 180 degrees and 270 degrees so as to obtain low-resolution images with different angles;
a training convolutional network is then constructed:
firstly, taking a low-resolution image LR as input, extracting shallow features through a convolution layer, then learning deep features of the image through a plurality of stacked CACB modules, finally fusing the extracted shallow features and deep features, and up-sampling in a sub-pixel convolution mode to obtain a high-resolution image;
the CACB module consists of four fusion convolution layers, and one-fourth of the characteristics of each fusion convolution layer are reserved for final characteristic fusion; the structural details of the fusion convolutional layer related in the module are divided into a training stage and a deployment stage;
the loss function used in the training process is:
L total =0.5*L 1 +0.05*L SSIM +0.1*L VGG +L adv
wherein L is 1 Is the average absolute error, L SSIM For structural similarity, L VGG To perceive loss, L adv Representing countermeasures against losses;
wherein F (-) is a feature map, G (I), of the 34 th layer output of the VGG19 network pre-trained on ImageNet i,j,k C is a picture generated by a generator i,j,k And D (-) is the output of the discriminator for the corresponding original picture with the foreground effect.
The fourth module is further configured to construct a model with a foreground rendering effect picture:
wherein I is
bokeh Representing the finally obtained image, I
org The original image is represented by a representation of the original image,
representing the multiplication of the matrix element by element, B
i (. Cndot.) is the i-th order blur function, W
i Characteristic weight matrix values representing an i-th layer data image,/->
Involves the ith level of blurring function B
i (. About.) is a superficial fuzzy neural network->
Obtained iteratively i times, expressed as:
the loss function/adopts the combination of a reconstruction function and the structural similarity SSIM, and an error value counter-propagation optimization model is adopted; wherein l 1 The method comprises the following steps:
wherein I is
boke The representation model generates an image with a foreground effect,
original image representing an image actually with a foreground effect,/->
Representing the generated image I
bokeh And (3) the actual image>
Structural similarity betweenThe body is as follows:
wherein alpha, beta and gamma are preset constants,
Representing the generated image I
bokeh And actual image
Brightness relation between->
Representing the generated image I
bokeh And (3) the actual image>
Contrast relation between->
Representing the generated image I
bokeh And (3) the actual image>
Structural relationship between the two.
A computing module comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program stored on a computer readable storage medium, the computing module being configured to execute the computer program by the processor by running the readable storage medium, thereby performing the steps of:
and step 1, data acquisition. And shooting night scenes by using a plurality of low-end mobile phones, taking out RAW format data, and simultaneously shooting long-exposure night scene RGB pictures in a single reverse mode, wherein the number of the night scenes is about 10 ten thousand pairs. After the data set is collected, the SIFT key point matching algorithm and the RANSAC algorithm are used for alignment operation, and the non-overlapping part of the SIFT key point matching algorithm and the RANSAC algorithm is removed. After matching is completed, the data is divided into a training set, a verification set and a test set.
Step 1-1, shooting different scenes through a plurality of preset image acquisition devices with different models to obtain a plurality of RAW format data samples, taking the RAW format data samples acquired by the image acquisition devices with different models in the same scene as a group of parent samples, dividing the parent samples into different child samples according to the model of the image acquisition device, and marking each sample;
Step 1-2, after data sample acquisition is completed, aligning the images, and removing the non-overlapping parts of the images; the step 1-2 of aligning the images further comprises matching key points of the images, and iterating the random subsets repeatedly on the basis;
wherein the keypoints of the matching images are further as follows:
step 1-2a, searching all image positions on a preset scale space, and extracting key points including corner points, edge points, bright points of dark areas and dark points of the bright areas through convolution operation; the calculation method of the scale space L (x, y, sigma) is as follows:
L(x,y,σ)=G(x,y,σ)·C(x,y)
wherein C (x, y) represents the midpoint coordinates of the key points, G (x, y, sigma) represents a Gaussian kernel function, sigma is a scale space factor, and a constant value is taken;
wherein the gaussian kernel function is expressed as follows:
wherein each symbol is as defined above;
step 1-2b, collecting gradient modulus values of key points:
step 1-2c, collecting direction distribution of key points:
wherein each symbol is as defined above;
step 1-2d, calculating the domain point k of the key point k i :
In (x) k ,y k ) The direction of the key point is indicated, and the rest symbols are the same as the above.
Step 1-3, after the image alignment operation of step 1-2, the data samples are further divided into a training set, a verification set and a test set.
And 2, designing a model. Referring to fig. 1, and in detail to fig. 2, first, we propose Super Night network (SNN, super night scene network) whose main body is an Encoder-Decoder structure. It consists of an Encoder (left side) and a Decoder (right side). The Encoder is similar to a common classification network in that there are multiple downsampling, and the feature map changes from large to narrow to small to wide. In each layer there are two 3x3 convolutions, each followed by an activation function (leakyReLU) and a switchblenormalization layer, and finally a 2x2 max_mapping operation, i.e. a step size of 2, is added for downsampling. The number of characteristic channels will double in each downsampling step. The entire Encoder needs to repeat this step three times.
Each step in the Decoder involves upsampling the feature map, here using nearest interpolation, followed by a 3x3 convolution that halves the number of feature channels, concatenating the corresponding feature map (specially processed) from the Encoder, followed by two 3x3 convolutions, again each followed by an activation function leakyReLU and switchblenormalization layer. At the last layer, only one 3*3 convolution layer is used, and the processed image is output through pixel_shuffle.
Here, to obtain more information from RAW data, we use a skip-connection structure and put it on the skip-connection, and put it on Residual Dense block. The residual DenseBlock consists of 3 Denseblocks, each with 5 layers of convolutions inside, each layer of convolutions followed by an activation function (LeakyReLU) and a Switchablenormalization layer, while each layer accepts output profiles from all the preceding convolutions, an operation that is a concatenation. In addition, to obtain more efficient information, we add a ChannelAttention module after Densblock. It consists of an averagemapping layer and 2 3 x 3 convolution layers and a nonlinear transformation ReLU and Sigmod layer, and the connection mode is shown in fig. 3.
And 3, training a model. Based on the model and the data set, we realized rapid training of the model using distributed training, which only took 2.5 hours. Step 3, in the process of training the super night scene network model, the training sample set is divided into a low-resolution training set and a high-resolution image block;
the low-resolution training set is obtained by the following steps: firstly, downsampling the high-resolution image by N times to obtain different low-resolution images; expanding the obtained low-resolution images, overlapping and sampling each obtained low-resolution image to obtain a group of overlapped low-resolution image blocks, and taking the overlapped low-resolution image blocks as a low-resolution training set;
The high-resolution image block acquisition mode is as follows:
overlapping and sampling the high-resolution image corresponding to the N times of downsampling operation, and then taking the obtained group of corresponding overlapped high-resolution image blocks as a high-resolution tag image; n is a positive integer;
the expansion mode of expanding the obtained low-resolution image is to perform rotation transformation of 90 degrees, 180 degrees and 270 degrees so as to obtain low-resolution images with different angles;
a training convolutional network is then constructed:
firstly, taking a low-resolution image LR as input, extracting shallow features through a convolution layer, then learning deep features of the image through a plurality of stacked CACB modules, finally fusing the extracted shallow features and deep features, and up-sampling in a sub-pixel convolution mode to obtain a high-resolution image;
the CACB module consists of four fusion convolution layers, and one-fourth of the characteristics of each fusion convolution layer are reserved for final characteristic fusion; the structural details of the fusion convolutional layer related in the module are divided into a training stage and a deployment stage;
the loss function used in the training process is:
L total =0.5*L 1 +0.05*L SSIM +0.1*L VGG +L adv
wherein L is 1 Is the average absolute error, L SSIM For structural similarity, L VGG To perceive loss, L adv Representing countermeasures against losses;
wherein F (-) is a feature map, G (I), of the 34 th layer output of the VGG19 network pre-trained on ImageNet i,j,k C is a picture generated by a generator i,j,k And D (-) is the output of the discriminator for the corresponding original picture with the foreground effect.
Step 4, outputting a result, and before outputting the result, making the image pass through a preset foreground rendering model, wherein the model is constructed as follows:
wherein I is
bokeh Representing the finally obtained image, I
org The original image is represented by a representation of the original image,
representing the multiplication of the matrix element by element, B
i (. Cndot.) is the i-th order blur function,W
i characteristic weight matrix values representing an i-th layer data image,/->
Involves the ith level of blurring function B
i (. About.) is a superficial fuzzy neural network->
Obtained iteratively i times, expressed as:
the loss function/adopts the combination of a reconstruction function and the structural similarity SSIM, and an error value counter-propagation optimization model is adopted; wherein l 1 The method comprises the following steps:
wherein I is
bokeh The representation model generates an image with a foreground effect,
original image representing an image actually with a foreground effect,/->
Representing the generated image I
bok And (3) the actual image>
The structural similarity is as follows:
Wherein alpha, beta and gamma are preset constants,
representing the generated image I
bokeh And actual image
Brightness relation between->
Representing the generated image I
bokeh And (3) the actual image>
Contrast relation between->
Representing the generated image I
bokeh And (3) the actual image>
Structural relationship between the two.
A computer module readable storage medium having stored thereon a computer program which when executed by a processor performs the following process:
and step 1, data acquisition. And shooting night scenes by using a plurality of low-end mobile phones, taking out RAW format data, and simultaneously shooting long-exposure night scene RGB pictures in a single reverse mode, wherein the number of the night scenes is about 10 ten thousand pairs. After the data set is collected, the SIFT key point matching algorithm and the RANSAC algorithm are used for alignment operation, and the non-overlapping part of the SIFT key point matching algorithm and the RANSAC algorithm is removed. After matching is completed, the data is divided into a training set, a verification set and a test set.
Step 1-1, shooting different scenes through a plurality of preset image acquisition devices with different models to obtain a plurality of RAW format data samples, taking the RAW format data samples acquired by the image acquisition devices with different models in the same scene as a group of parent samples, dividing the parent samples into different child samples according to the model of the image acquisition device, and marking each sample;
Step 1-2, after data sample acquisition is completed, aligning the images, and removing the non-overlapping parts of the images; the step 1-2 of aligning the images further comprises matching key points of the images, and iterating the random subsets repeatedly on the basis;
wherein the keypoints of the matching images are further as follows:
step 1-2a, searching all image positions on a preset scale space, and extracting key points including corner points, edge points, bright points of dark areas and dark points of the bright areas through convolution operation; the calculation method of the scale space L (x, y, sigma) is as follows:
L(x,y,σ)=G(x,y,σ)·C(x,y)
wherein C (x, y) represents the midpoint coordinates of the key points, G (x, y, sigma) represents a Gaussian kernel function, sigma is a scale space factor, and a constant value is taken;
wherein the gaussian kernel function is expressed as follows:
wherein each symbol is as defined above;
step 1-2b, collecting gradient modulus values of key points:
step 1-2c, collecting direction distribution of key points:
wherein each symbol is as defined above;
step 1-2d, calculating the domain point k of the key point k i :
In (x) k ,y k ) Representing the position of the key point, the remaining symbolsThe numbers are as defined above.
Step 1-3, after the image alignment operation of step 1-2, the data samples are further divided into a training set, a verification set and a test set.
And 2, designing a model. Referring to fig. 1, and in detail to fig. 2, first, we propose Super Night network (SNN, super night scene network) whose main body is an Encoder-Decoder structure. It consists of an Encoder (left side) and a Decoder (right side). The Encoder is similar to a common classification network in that there are multiple downsampling, and the feature map changes from large to narrow to small to wide. In each layer there are two 3x3 convolutions, each followed by an activation function (leakyReLU) and a switchblenormalization layer, and finally a 2x2 max_mapping operation, i.e. a step size of 2, is added for downsampling. The number of characteristic channels will double in each downsampling step. The entire Encoder needs to repeat this step three times.
Each step in the Decoder involves upsampling the feature map, here using nearest interpolation, followed by a 3x3 convolution that halves the number of feature channels, concatenating the corresponding feature map (specially processed) from the Encoder, followed by two 3x3 convolutions, again each followed by an activation function leakyReLU and switchblenormalization layer. At the last layer, only one 3*3 convolution layer is used, and the processed image is output through pixel_shuffle.
Here, to obtain more information from RAW data, we use a skip-connection structure and put it on the skip-connection, and put it on Residual Dense block. The residual DenseBlock consists of 3 Denseblocks, each with 5 layers of convolutions inside, each layer of convolutions followed by an activation function (LeakyReLU) and a Switchablenormalization layer, while each layer accepts output profiles from all the preceding convolutions, an operation that is a concatenation. In addition, to obtain more efficient information, we add a ChannelAttention module after Densblock. It consists of an averagemapping layer and 2 3 x 3 convolution layers and a nonlinear transformation ReLU and Sigmod layer, and the connection mode is shown in fig. 3.
And 3, training a model. Based on the model and the data set, we realized rapid training of the model using distributed training, which only took 2.5 hours. Step 3, in the process of training the super night scene network model, the training sample set is divided into a low-resolution training set and a high-resolution image block;
the low-resolution training set is obtained by the following steps: firstly, downsampling the high-resolution image by N times to obtain different low-resolution images; expanding the obtained low-resolution images, overlapping and sampling each obtained low-resolution image to obtain a group of overlapped low-resolution image blocks, and taking the overlapped low-resolution image blocks as a low-resolution training set;
The high-resolution image block acquisition mode is as follows:
overlapping and sampling the high-resolution image corresponding to the N times of downsampling operation, and then taking the obtained group of corresponding overlapped high-resolution image blocks as a high-resolution tag image; n is a positive integer;
the expansion mode of expanding the obtained low-resolution image is to perform rotation transformation of 90 degrees, 180 degrees and 270 degrees so as to obtain low-resolution images with different angles;
a training convolutional network is then constructed:
firstly, taking a low-resolution image LR as input, extracting shallow features through a convolution layer, then learning deep features of the image through a plurality of stacked CACB modules, finally fusing the extracted shallow features and deep features, and up-sampling in a sub-pixel convolution mode to obtain a high-resolution image;
the CACB module consists of four fusion convolution layers, and one-fourth of the characteristics of each fusion convolution layer are reserved for final characteristic fusion; the structural details of the fusion convolutional layer related in the module are divided into a training stage and a deployment stage;
the loss function used in the training process is:
L total =0.5*L 1 +0.05*L SSIM +0.1*L VGG +L adv
wherein L is 1 Is the average absolute error, L SSIM For structural similarity, L VGG To perceive loss, L adv Representing countermeasures against losses;
wherein F (-) is a feature map, G (I), of the 34 th layer output of the VGG19 network pre-trained on ImageNet i,j,k C is a picture generated by a generator i,j,k And D (-) is the output of the discriminator for the corresponding original picture with the foreground effect.
And 4, outputting a result.
The beneficial effects are that: the invention relates to an image night scene processing method based on a convolutional neural network, and further relates to a computing module capable of running the method and a storage medium capable of being read by the computing module. By building the super night scene network model and training the model, the super night scene picture with excellent appearance can be obtained only by taking out RAW data from a camera CMOS, the problems of image shake and ghost caused by long-time exposure of the traditional night scene function are avoided, and the influence of the image shake and ghost when the image is synthesized by adopting an AI technology is further avoided.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the invention may be practiced without one or more of these details. In other instances, well-known features have not been described in detail in order to avoid obscuring the invention.
The applicant believes that the principle of the super night scene function at present is that a plurality of ISO photos with different exposures are shot through long exposure and then synthesized, but the super night scene function is more greedy, the exposure time is often more than a few seconds, so the requirements on mobile phone hardware and software algorithms are very high. In addition, the super night scene mode is difficult to pick up high quality pictures in long exposure because in long exposure, the human hand must have uncontrollable slight shake, and if no algorithm is used for finishing, the mobile phone can incorporate these shake pictures, resulting in problems in the picture. Part of mobile phone manufacturers adopt AI technology to remove the blurred photos, then automatically align scenes through system identification, and finally synthesize the scenes.
In order to solve the above problems, we propose to use convolutional neural network to realize super night scene mode, and for our algorithm, we can obtain super night scene picture with excellent look and feel only by taking RAW data from camera CMOS. The specific algorithm flow is as follows:
and step 1, data acquisition. And shooting night scenes by using a plurality of low-end mobile phones, taking out RAW format data, and simultaneously shooting long-exposure night scene RGB pictures in a single reverse mode, wherein the number of the night scenes is about 10 ten thousand pairs. After the data set is collected, the SIFT key point matching algorithm and the RANSAC algorithm are used for alignment operation, and the non-overlapping part of the SIFT key point matching algorithm and the RANSAC algorithm is removed. After matching is completed, the data is divided into a training set, a verification set and a test set.
Step 1-1, shooting different scenes through a plurality of preset image acquisition devices with different models to obtain a plurality of RAW format data samples, taking the RAW format data samples acquired by the image acquisition devices with different models in the same scene as a group of parent samples, dividing the parent samples into different child samples according to the model of the image acquisition device, and marking each sample;
step 1-2, after data sample acquisition is completed, aligning the images, and removing the non-overlapping parts of the images; the step 1-2 of aligning the images further comprises matching key points of the images, and iterating the random subsets repeatedly on the basis;
Wherein the keypoints of the matching images are further as follows:
step 1-2a, searching all image positions on a preset scale space, and extracting key points including corner points, edge points, bright points of dark areas and dark points of the bright areas through convolution operation; the calculation method of the scale space L (x, y, sigma) is as follows:
L(x,y,σ)=G(x,y,σ)·C(x,y)
wherein C (x, y) represents the midpoint coordinates of the key points, G (x, y, sigma) represents a Gaussian kernel function, sigma is a scale space factor, and a constant value is taken;
wherein the gaussian kernel function is expressed as follows:
wherein each symbol is as defined above;
step 1-2b, collecting gradient modulus values of key points:
step 1-2c, collecting direction distribution of key points:
wherein each symbol is as defined above;
step 1-2d, calculating the domain point k of the key point k i :
In (x) k ,y k ) The direction of the key point is indicated, and the rest symbols are the same as the above.
Step 1-3, after the image alignment operation of step 1-2, the data samples are further divided into a training set, a verification set and a test set.
And 2, designing a model. Referring to fig. 1, and in detail, referring to fig. 2, first, we propose Super Night network (SNN, super night scene network) which is shown in fig. 2 and has an Encoder-Decoder structure as a main body. It consists of an Encoder (left side) and a Decoder (right side). The Encoder is similar to a common classification network in that there are multiple downsampling, and the feature map changes from large to narrow to small to wide. In each layer there are two 3x3 convolutions, each followed by an activation function (leakyReLU) and a switchblenormalization layer, and finally a 2x2 max_mapping operation, i.e. a step size of 2, is added for downsampling. The number of characteristic channels will double in each downsampling step. The entire Encoder needs to repeat this step three times.
Each step in the Decoder involves upsampling the feature map, here using nearest interpolation, followed by a 3x3 convolution that halves the number of feature channels, concatenating the corresponding feature map (specially processed) from the Encoder, followed by two 3x3 convolutions, again each followed by an activation function leakyReLU and switchblenormalization layer. At the last layer, only one 3*3 convolution layer is used, and the processed image is output through pixel_shuffle.
Here, to obtain more information from RAW data, we use a skip-connection structure and put it on the skip-connection, and put it on Residual Dense block. The residual DenseBlock consists of 3 Denseblocks, each with 5 layers of convolutions inside, each layer of convolutions followed by an activation function (LeakyReLU) and a Switchablenormalization layer, while each layer accepts output profiles from all the preceding convolutions, an operation that is a concatenation. In addition, to obtain more efficient information, we add a ChannelAttention module after Densblock. It consists of an averagemapping layer and 2 3x3 convolution layers and a nonlinear transformation ReLU and Sigmod layer, and the connection mode is shown in fig. 3.
And 3, training a model. Based on the model and the data set, we realized rapid training of the model using distributed training, which only took 2.5 hours. Step 3, in the process of training the super night scene network model, the training sample set is divided into a low-resolution training set and a high-resolution image block;
the low-resolution training set is obtained by the following steps: firstly, downsampling the high-resolution image by N times to obtain different low-resolution images; expanding the obtained low-resolution images, overlapping and sampling each obtained low-resolution image to obtain a group of overlapped low-resolution image blocks, and taking the overlapped low-resolution image blocks as a low-resolution training set;
the high-resolution image block acquisition mode is as follows:
overlapping and sampling the high-resolution image corresponding to the N times of downsampling operation, and then taking the obtained group of corresponding overlapped high-resolution image blocks as a high-resolution tag image; n is a positive integer;
the expansion mode of expanding the obtained low-resolution image is to perform rotation transformation of 90 degrees, 180 degrees and 270 degrees so as to obtain low-resolution images with different angles;
A training convolutional network is then constructed:
firstly, taking a low-resolution image LR as input, extracting shallow features through a convolution layer, then learning deep features of the image through a plurality of stacked CACB modules, finally fusing the extracted shallow features and deep features, and up-sampling in a sub-pixel convolution mode to obtain a high-resolution image;
the CACB module consists of four fusion convolution layers, and one-fourth of the characteristics of each fusion convolution layer are reserved for final characteristic fusion; the structural details of the fusion convolutional layer related in the module are divided into a training stage and a deployment stage;
the loss function used in the training process is:
L total =0.5*L 1 +0.05*L SSIM +0.1*L VGG +L adv
wherein L is 1 Is the average absolute error, L SSIM For structural similarity, L VGG To perceive loss, L adv Representing countermeasures against losses;
wherein F (-) is a feature map, G (I), of the 34 th layer output of the VGG19 network pre-trained on ImageNet i,j,k C is a picture generated by a generator i,j,k And D (-) is the output of the discriminator for the corresponding original picture with the foreground effect.
And 4, outputting a result.
Before outputting the result, the image passes through a preset foreground rendering model, and the model is constructed as follows:
wherein I is
bokeh Representing the finally obtained image, I
org The original image is represented by a representation of the original image,
representing the multiplication of the matrix element by element, B
i (. Cndot.) is the i-th order blur function, W
i Characteristic weight matrix values representing an i-th layer data image,/->
Involves the ith level of blurring function B
i (. About.) is a superficial fuzzy neural network->
Obtained iteratively i times, expressed as:
the loss function/adopts the combination of a reconstruction function and the structural similarity SSIM, and an error value counter-propagation optimization model is adopted; wherein l 1 The method comprises the following steps:
wherein I is
bokeh The representation model generates an image with a foreground effect,
original image representing an image actually with a foreground effect,/->
Representing the generated image I
boke And (3) the actual image>
The structural similarity is as follows:
wherein alpha, beta and gamma are preset constants,
representing the generated image I
bokeh And actual image
Brightness relation between->
Representing the generated image I
bok And (3) the actual image>
Contrast relation between->
Representing the generated image I
bokeh And (3) the actual image>
Structural relationship between the two.
In fig. 4, the left side is a night scene picture taken by red rice 8, and the right side is a night scene picture obtained by our network (the data used by both are RAW data of red rice 8). It is obvious that the effect of the red rice is richer than the detail of the red rice, the color is soft, and the red rice is in line with the eye feeling of people.
In conclusion, the algorithm flow can effectively improve shooting requirements of the low-end mobile phone in night scenes, and meanwhile, the cost is lower. This is less expensive for the consumer purchasing the low-end handset, but can result in the super night vision technology of the high-end handset. In addition, the requirement of the super night scene algorithm on hardware is greatly reduced, and if the NPU chip of Airia is matched, the cost is further reduced, and the cost performance of the mobile phone is improved.
As described above, although the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.