Background
The area of a camera sensor of the mobile phone is small, the light sensing capability is poor, and more natural light cannot be captured under the condition of insufficient light, so that the noise of a picture is more, the brightness is insufficient, and the resolving power is weak. However, the handheld long-exposure super night scene function is started, the brightness of the picture is greatly improved, the light and shade details are prominent, and even if the brightness of the picture is increased in a surging manner, the overexposure phenomenon cannot occur in a high-light area.
At present, the super night scene function appears on more and more mobile phones. The principle is that a plurality of different ISO pictures with different exposures are shot through long exposure and then synthesized.
However, the super night scene function is greedy, and the exposure time is usually more than a few seconds, so the requirements on mobile phone hardware and software algorithms are very high. In addition, the super night scene mode is difficult to select high-quality pictures in long-time exposure, because in long-time exposure, the hand can have uncontrollable slight shaking, and if the arrangement is not carried out by adopting an algorithm, the hand can combine the shaken pictures, so that the picture is in a problem.
Disclosure of Invention
The purpose of the invention is as follows: an object is to provide an image night scene processing method based on a convolutional neural network, so as to solve the above problems in the prior art. A further object is to propose a calculation module that can execute the above method and a storage medium that can be read by the calculation module.
The technical scheme is as follows: an image night scene processing method based on a convolutional neural network comprises the following steps:
step 1, collecting a plurality of groups of RAW format data samples;
step 2, designing a super night scene network model;
step 3, training the super night scene network model in the step 2;
and 4, outputting a result.
In a further embodiment, the step 1 is further:
step 1-1, shooting different scenes through a plurality of preset image acquisition devices with different models to obtain a plurality of RAW format data samples, taking the RAW format data samples acquired by the image acquisition devices with different models in the same scene as a group of mother samples, distinguishing the mother samples into different sub-samples according to the models of the image acquisition devices, and marking each sample;
step 1-2, after data sample collection is completed, aligning the images, and removing non-overlapping parts of the images;
and 1-3, after the image alignment operation of the step 1-2, further dividing the data sample into a training set, a verification set and a test set.
In a further embodiment, the aligning the images in step 1-2 further comprises matching key points of the images, and iteratively iterating a plurality of random subsets based thereon;
wherein the keypoints of the matching images are further as follows:
step 1-2a, searching all image positions on a preset scale space, and extracting key points including corner points, edge points, bright points of dark areas and dark points of bright areas through convolution operation; the calculation method of the scale space L (x, y, sigma) is as follows:
L(x,y,σ)=G(x,y,σ)·C(x,y)
in the formula, C (x, y) represents the midpoint coordinate of the key point, G (x, y, sigma) represents a Gaussian kernel function, and sigma is a scale space factor and takes a fixed value;
wherein the gaussian kernel function is represented as follows:
wherein each symbol has the same meaning as above;
step 1-2b, collecting gradient module values of key points:
step 1-2c, collecting the direction distribution of key points:
wherein each symbol has the same meaning as above;
step 1-2d, calculating a domain point k of the key point ki:
In the formula (x)k,yk) The orientation of the key points is shown, and the other symbols have the same meanings as above.
In a further embodiment, the step 2 further comprises:
step 2-1, establishing an SNN super night scene network model, wherein the SNN super night scene network model comprises at least one Encoder network and at least one Decoder network, and each network comprises a plurality of layers; the Encoder network has a plurality of downsampling, each layer has at least two 3x3 convolutions, each convolution is connected with an activation function and a Switchable enormalization layer, and finally a 2x2max _ posing operation is added, namely the step length is 2 for downsampling; repeating the whole Encoder network for three times;
each step in the Decoder network includes upsampling the signature, convolving the 3x3 with half the number of signature channels, concatenating with the corresponding signature from the Encoder network, and then convolving the concatenated signature through two 3x3, each convolution followed by an activation function and a switchablenmapping layer; using a 3-by-3 convolution layer on the last layer, and finally outputting a processed image through pixel _ shuffle;
and 2-2, proposing a Residual Dense block, placing the Residual Dense block on a skip-connection, wherein the Residual Dense block consists of 3 Dense blocks, each Dense block is internally provided with 5 layers of convolutions, an activation function and a switchableRermalization layer are connected behind each layer of convolution, and each layer receives output characteristic diagrams from all the convolution layers in front.
In a further embodiment, in the process of training the super night scene network model in the step 3, the training sample set is divided into a low-resolution training set and a high-resolution image block;
the acquisition mode of the low-resolution training set is as follows: firstly, carrying out N times of downsampling on a high-resolution image to obtain different low-resolution images; expanding the obtained low-resolution images, performing overlapping sampling on each obtained low-resolution image to obtain a group of overlapped low-resolution image blocks, and taking the group of overlapped low-resolution image blocks as a low-resolution training set;
the high resolution image block acquisition mode is as follows:
overlapping and sampling the high-resolution image corresponding to the N times of downsampling operation, and then taking a group of correspondingly overlapped high-resolution image blocks as a high-resolution label image; n is a positive integer;
the expansion mode of expanding the obtained low-resolution images is that the low-resolution images are subjected to rotation transformation of 90 degrees, 180 degrees and 270 degrees, so that low-resolution images of different angles are obtained;
a training convolutional network is then constructed:
firstly, taking a low-resolution image LR as an input, extracting shallow layer features through a convolution layer, then learning deep layer features of the image through a plurality of stacked CACBs, finally fusing the extracted shallow layer features and the deep layer features, and obtaining a high-resolution image through up-sampling in a sub-pixel convolution mode;
wherein the CACB module is composed of four fused convolutional layers, and one quarter of the features of each fused convolutional layer are reserved for final feature fusion; the structural details of the fusion convolution layer related in the module are divided into a training phase and a deployment phase;
the loss function used during the training process is:
Ltotal=0.5*L1+0.05*LSSIM+0.1*LVGG+Ladv
wherein L is1To mean absolute error, LSSIMFor structural similarity, LVGGFor perception of loss, LadvRepresenting a loss of confrontation;
wherein F (-) is the feature map of the output at layer 34 of VGG19 network pre-trained on ImageNet, G (I)i,j,kFor pictures generated by the generator, Ci,j,kD (-) is the output of the discriminator, for the corresponding original picture with the shot effect.
In a further embodiment, step 4 further includes acquiring an image by an image acquisition sensor, training the super night scene network model obtained in step 3, and finally modifying the image and outputting a night scene image; performing shot rendering before outputting the night image, wherein the model with the shot rendering effect picture can be specifically constructed as follows:
wherein, I
bokehRepresenting the finally obtained image, I
orgWhich represents the original image or images of the original image,
representing the multiplication of the matrix element by element, B
i(. h) is the i-th order blur function, W
iA characteristic weight matrix value representing an i-th layer data image,
involving the i-th order blur function B
i(. is a shallow fuzzy neural network
Obtained i iterations, which is expressed as:
the loss function l adopts the combination of a reconstruction function and a structural similarity SSIM, and optimizes a model through the back propagation of an error value; wherein l1The method specifically comprises the following steps:
wherein, I
bokehThe representation model generates an image with a shot effect,
an original image representing an image with an actual shot effect,
representing the generated image I
bokehWith the actual image
The structural similarity between the two is as follows:
wherein alpha, beta and gamma are preset constants,
representing the generated image I
bokehWith the actual image
The relationship between the brightness of the light source and the brightness of the light source,
representing the generated image I
bokehWith the actual image
The contrast ratio relationship between the two components,
representing the generated image I
bokehWith the actual image
Structural relationship between them.
An image night scene processing system based on a convolutional neural network comprises a first module, a second module and a third module, wherein the first module is used for acquiring a plurality of groups of RAW format data samples; the second module is used for establishing a super night scene network model; a third module for training the super night scene network model; and a fourth module for performing shot rendering on the night scene image before output.
The first module further shoots different scenes through a plurality of preset image acquisition devices with different models to obtain a plurality of RAW format data samples, the RAW format data samples acquired by the image acquisition devices with different models in the same scene are used as a group of mother samples, the mother samples are divided into different sub-samples according to the models of the image acquisition devices, and each sample is marked;
after the data sample is acquired, aligning the images, and removing the non-overlapping parts of the images; aligning the images, including matching key points of the images, and repeatedly iterating for multiple random subsets on the basis;
wherein the keypoints of the matching images are further as follows:
searching all image positions on a preset scale space, and extracting key points including corner points, edge points, bright points of dark areas and dark points of bright areas through convolution operation; the calculation method of the scale space L (x, y, sigma) is as follows:
L(x,y,σ)=G(x,y,σ)·C(x,y)
in the formula, C (x, y) represents the midpoint coordinate of the key point, G (x, y, sigma) represents a Gaussian kernel function, and sigma is a scale space factor and takes a fixed value;
wherein the gaussian kernel function is represented as follows:
wherein each symbol has the same meaning as above;
collecting gradient module values of key points:
collecting the direction distribution of key points:
wherein each symbol has the same meaning as above;
calculating a domain point k of a keypoint ki:
In the formula (x)k,yk) The direction of the key point is shown, and the meanings of the other symbols are the same as above;
dividing the data sample into a training set, a verification set and a test set;
the second module is further used for establishing an SNN super night scene network model, and comprises at least one Encoder network and at least one Decoder network, wherein each network comprises a plurality of layers; the Encoder network has a plurality of downsampling, each layer has at least two 3x3 convolutions, each convolution is connected with an activation function and a Switchable enormalization layer, and finally a 2x2max _ posing operation is added, namely the step length is 2 for downsampling; repeating the whole Encoder network for three times;
each step in the Decoder network includes upsampling the signature, convolving the 3x3 with half the number of signature channels, concatenating with the corresponding signature from the Encoder network, and then convolving the concatenated signature through two 3x3, each convolution followed by an activation function and a switchablenmapping layer; using a 3-by-3 convolution layer on the last layer, and finally outputting a processed image through pixel _ shuffle;
proposing a Residual Dense block, placing the Residual Dense block on a skip-connection, wherein the Residual Dense block consists of 3 Dense blocks, each Dense block is internally provided with 5 layers of convolutions, an activation function and a switchableRermalization layer are connected behind each layer of convolution, and each layer receives output characteristic diagrams from all the convolution layers in front;
the third module is further used for dividing the training sample set into a low-resolution training set and a high-resolution image block;
the acquisition mode of the low-resolution training set is as follows: firstly, carrying out N times of downsampling on a high-resolution image to obtain different low-resolution images; expanding the obtained low-resolution images, performing overlapping sampling on each obtained low-resolution image to obtain a group of overlapped low-resolution image blocks, and taking the group of overlapped low-resolution image blocks as a low-resolution training set;
the high resolution image block acquisition mode is as follows:
overlapping and sampling the high-resolution image corresponding to the N times of downsampling operation, and then taking a group of correspondingly overlapped high-resolution image blocks as a high-resolution label image; n is a positive integer;
the expansion mode of expanding the obtained low-resolution images is that the low-resolution images are subjected to rotation transformation of 90 degrees, 180 degrees and 270 degrees, so that low-resolution images of different angles are obtained;
a training convolutional network is then constructed:
firstly, taking a low-resolution image LR as an input, extracting shallow layer features through a convolution layer, then learning deep layer features of the image through a plurality of stacked CACBs, finally fusing the extracted shallow layer features and the deep layer features, and obtaining a high-resolution image through up-sampling in a sub-pixel convolution mode;
wherein the CACB module is composed of four fused convolutional layers, and one quarter of the features of each fused convolutional layer are reserved for final feature fusion; the structural details of the fusion convolution layer related in the module are divided into a training phase and a deployment phase;
the loss function used during the training process is:
Ltotal=0.5*L1+0.05*LSSIM+0.1*LVGG+Ladv
wherein L is1To mean absolute error, LSSIMFor structural similarity, LVGGFor perception of loss, LadvRepresenting a loss of confrontation;
wherein F (-) is the feature map of the output at layer 34 of VGG19 network pre-trained on ImageNet, G (I)i,j,kFor pictures generated by the generator, Ci,j,kD (-) is the output of the discriminator, for the corresponding original picture with the shot effect.
The fourth module is further used for constructing a model with a shot rendering effect picture:
wherein, I
bokehRepresenting the finally obtained image, I
orgWhich represents the original image or images of the original image,
representing the multiplication of the matrix element by element, B
i(. h) is the i-th order blur function, W
iA characteristic weight matrix value representing an i-th layer data image,
involving the i-th order blur function B
i(. is a shallow fuzzy neural network
Obtained i iterations, which is expressed as:
the loss function l adopts the combination of a reconstruction function and a structural similarity SSIM, and optimizes a model through the back propagation of an error value; wherein l1The method specifically comprises the following steps:
wherein, I
bokehThe representation model generates an image with a shot effect,
an original image representing an image with an actual shot effect,
representing the generated image I
bokehWith the actual image
The structural similarity between the two is as follows:
wherein alpha, beta and gamma are preset constants,
representing the generated image I
bokehWith the actual image
The relationship between the brightness of the light source and the brightness of the light source,
representing the generated image I
bokehWith the actual image
The contrast ratio relationship between the two components,
representing the generated image I
bokeWith the actual image
Structural relationship between them.
A computing module comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program being stored on a computer readable storage medium, the computing module executing the computer program by the processor by executing the readable storage medium, thereby implementing the steps of:
step 1, data acquisition. And shooting night scenes by using a plurality of low-end mobile phones, taking out RAW format data, and simultaneously shooting long-exposure night scene RGB pictures by using single reflection, wherein the number of the long-exposure night scene RGB pictures is about 10 ten thousand pairs. After the data set is collected, an SIFT key point matching algorithm and an RANSAC algorithm are used for alignment operation, and non-overlapping parts of the SIFT key point matching algorithm and the RANSAC algorithm are removed. After matching is completed, the data is divided into a training set, a validation set and a test set.
Step 1-1, shooting different scenes through a plurality of preset image acquisition devices with different models to obtain a plurality of RAW format data samples, taking the RAW format data samples acquired by the image acquisition devices with different models in the same scene as a group of mother samples, distinguishing the mother samples into different sub-samples according to the models of the image acquisition devices, and marking each sample;
step 1-2, after data sample collection is completed, aligning the images, and removing non-overlapping parts of the images; the step 1-2 of aligning the images further comprises matching key points of the images, and repeatedly iterating for multiple times to obtain random subsets on the basis;
wherein the keypoints of the matching images are further as follows:
step 1-2a, searching all image positions on a preset scale space, and extracting key points including corner points, edge points, bright points of dark areas and dark points of bright areas through convolution operation; the calculation method of the scale space L (x, y, sigma) is as follows:
L(x,y,σ)=G(x,y,σ)·C(x,y)
in the formula, C (x, y) represents the midpoint coordinate of the key point, G (x, y, sigma) represents a Gaussian kernel function, and sigma is a scale space factor and takes a fixed value;
wherein the gaussian kernel function is represented as follows:
wherein each symbol has the same meaning as above;
step 1-2b, collecting gradient module values of key points:
step 1-2c, collecting the direction distribution of key points:
wherein each symbol has the same meaning as above;
step 1-2d, calculating a domain point k of the key point ki:
In the formula (x)k,yk) The orientation of the key points is shown, and the other symbols have the same meanings as above.
And 1-3, after the image alignment operation of the step 1-2, further dividing the data sample into a training set, a verification set and a test set.
And 2, designing a model. Referring to fig. 1, and fig. 2 for details, first, we propose a Super Night Network (SNN), the main body of which is an Encoder-Decoder structure. It consists of an Encoder (left side) and a Decoder (right side). Similar to a common classification network, the Encoder has multiple times of down sampling, and a feature map is changed from large to narrow to small and wide. In each layer there are two 3x3 convolutions, each convolution followed by an activation function (learkrlu) and a switchablenmapping layer, and finally a 2x2max _ posing operation, i.e. step size of 2 for downsampling. The number of signature channels will be doubled in each down-sampling step. This step needs to be repeated three times for the entire Encoder.
Each step in Decoder consists of upsampling the signature, here using nearest interpolation, followed by a convolution of 3x3 with half the number of signature channels, concatenated with the corresponding signature (after special processing) from the Encoder, followed by a convolution of the concatenated signature by two 3x3, again each convolution followed by an activation function, lakyrelu, and a switchablenwarping layer. In the last layer, only one 3 × 3 convolution layer is used, and the processed image is output through pixel _ shuffle.
In order to obtain more information from the RAW data, a skip-connection structure is used, a Residual sense block is proposed and placed on the skip-connection structure. The ResidualDensblock is composed of 3 Densblocks, each Densblock is internally provided with 5 layers of convolution, each layer of convolution is followed by an activation function (LEAKyReLU) and a SwitchabeReynarestriction layer, and meanwhile, each layer receives output characteristic graphs from all the previous convolution layers, and the operation is serial splicing. In addition, to obtain more effective information, we add a channeltention module after Denseblock. It is composed of an Averagepooling layer and 2 3 × 3 convolutional layers and a nonlinear transformation ReLU and Sigmod layer, and the connection mode is shown in FIG. 3.
And step 3, training the model. Based on the model and the data set, the rapid training of the model is realized by using distributed training, and the process only needs 2.5 hours. Step 3, dividing training samples into a low-resolution training set and a high-resolution image block in a concentrated mode in the process of training the super night scene network model;
the acquisition mode of the low-resolution training set is as follows: firstly, carrying out N times of downsampling on a high-resolution image to obtain different low-resolution images; expanding the obtained low-resolution images, performing overlapping sampling on each obtained low-resolution image to obtain a group of overlapped low-resolution image blocks, and taking the group of overlapped low-resolution image blocks as a low-resolution training set;
the high resolution image block acquisition mode is as follows:
overlapping and sampling the high-resolution image corresponding to the N times of downsampling operation, and then taking a group of correspondingly overlapped high-resolution image blocks as a high-resolution label image; n is a positive integer;
the expansion mode of expanding the obtained low-resolution images is that the low-resolution images are subjected to rotation transformation of 90 degrees, 180 degrees and 270 degrees, so that low-resolution images of different angles are obtained;
a training convolutional network is then constructed:
firstly, taking a low-resolution image LR as an input, extracting shallow layer features through a convolution layer, then learning deep layer features of the image through a plurality of stacked CACBs, finally fusing the extracted shallow layer features and the deep layer features, and obtaining a high-resolution image through up-sampling in a sub-pixel convolution mode;
wherein the CACB module is composed of four fused convolutional layers, and one quarter of the features of each fused convolutional layer are reserved for final feature fusion; the structural details of the fusion convolution layer related in the module are divided into a training phase and a deployment phase;
the loss function used during the training process is:
Ltotal=0.5*L1+0.05*LSSIM+0.1*LVGG+Ladv
wherein L is1To mean absolute error, LSSIMFor structural similarity, LVGGFor perception of loss, LadvRepresenting a loss of confrontation;
wherein F (-) is the feature map of the output at layer 34 of VGG19 network pre-trained on ImageNet, G (I)i,j,kFor pictures generated by the generator, Ci,j,kD (-) is the output of the discriminator, for the corresponding original picture with the shot effect.
And 4, outputting a result, wherein before the result is output, the image passes through a preset shot rendering model, and the model is constructed as follows:
wherein, I
bokehRepresenting the finally obtained image, I
orgWhich represents the original image or images of the original image,
representing the multiplication of the matrix element by element, B
i(. h) is the i-th order blur function, W
iA characteristic weight matrix value representing an i-th layer data image,
involving the i-th order blur function B
i(. is a shallow fuzzy neural network
Obtained i iterations, which is expressed as:
the loss function l adopts the combination of a reconstruction function and a structural similarity SSIM, and optimizes a model through the back propagation of an error value; wherein l1The method specifically comprises the following steps:
wherein, I
bokeThe representation model generates an image with a shot effect,
an original image representing an image with an actual shot effect,
representing the generated image I
bokehWith the actual image
The structural similarity between the two is as follows:
wherein alpha, beta and gamma are preset constants,
representing the generated image I
bokehWith the actual image
The relationship between the brightness of the light source and the brightness of the light source,
representing the generated image I
bokehWith the actual image
The contrast ratio relationship between the two components,
representing the generated image I
bokeWith the actual image
Structural relationship between them.
A computing module readable storage medium, having stored thereon a computer program which, when executed by a processor, performs the processes of:
step 1, data acquisition. And shooting night scenes by using a plurality of low-end mobile phones, taking out RAW format data, and simultaneously shooting long-exposure night scene RGB pictures by using single reflection, wherein the number of the long-exposure night scene RGB pictures is about 10 ten thousand pairs. After the data set is collected, an SIFT key point matching algorithm and an RANSAC algorithm are used for alignment operation, and non-overlapping parts of the SIFT key point matching algorithm and the RANSAC algorithm are removed. After matching is completed, the data is divided into a training set, a validation set and a test set.
Step 1-1, shooting different scenes through a plurality of preset image acquisition devices with different models to obtain a plurality of RAW format data samples, taking the RAW format data samples acquired by the image acquisition devices with different models in the same scene as a group of mother samples, distinguishing the mother samples into different sub-samples according to the models of the image acquisition devices, and marking each sample;
step 1-2, after data sample collection is completed, aligning the images, and removing non-overlapping parts of the images; the step 1-2 of aligning the images further comprises matching key points of the images, and repeatedly iterating for multiple times to obtain random subsets on the basis;
wherein the keypoints of the matching images are further as follows:
step 1-2a, searching all image positions on a preset scale space, and extracting key points including corner points, edge points, bright points of dark areas and dark points of bright areas through convolution operation; the calculation method of the scale space L (x, y, sigma) is as follows:
L(x,y,σ)=G(x,y,σ)·C(x,y)
in the formula, C (x, y) represents the midpoint coordinate of the key point, G (x, y, sigma) represents a Gaussian kernel function, and sigma is a scale space factor and takes a fixed value;
wherein the gaussian kernel function is represented as follows:
wherein each symbol has the same meaning as above;
step 1-2b, collecting gradient module values of key points:
step 1-2c, collecting the direction distribution of key points:
wherein each symbol has the same meaning as above;
step 1-2d, calculating a domain point k of the key point ki:
In the formula (x)k,yk) The orientation of the key points is shown, and the other symbols have the same meanings as above.
And 1-3, after the image alignment operation of the step 1-2, further dividing the data sample into a training set, a verification set and a test set.
And 2, designing a model. Referring to fig. 1, and fig. 2 for details, first, we propose a Super Night Network (SNN), the main body of which is an Encoder-Decoder structure. It consists of an Encoder (left side) and a Decoder (right side). Similar to a common classification network, the Encoder has multiple times of down sampling, and a feature map is changed from large to narrow to small and wide. In each layer there are two 3x3 convolutions, each convolution followed by an activation function (learkrlu) and a switchablenmapping layer, and finally a 2x2max _ posing operation, i.e. step size of 2 for downsampling. The number of signature channels will be doubled in each down-sampling step. This step needs to be repeated three times for the entire Encoder.
Each step in Decoder consists of upsampling the signature, here using nearest interpolation, followed by a convolution of 3x3 with half the number of signature channels, concatenated with the corresponding signature (after special processing) from the Encoder, followed by a convolution of the concatenated signature by two 3x3, again each convolution followed by an activation function, lakyrelu, and a switchablenwarping layer. In the last layer, only one 3 × 3 convolution layer is used, and the processed image is output through pixel _ shuffle.
In order to obtain more information from the RAW data, a skip-connection structure is used, a Residual sense block is proposed and placed on the skip-connection structure. The ResidualDensblock is composed of 3 Densblocks, each Densblock is internally provided with 5 layers of convolution, each layer of convolution is followed by an activation function (LEAKyReLU) and a SwitchabeReynarestriction layer, and meanwhile, each layer receives output characteristic graphs from all the previous convolution layers, and the operation is serial splicing. In addition, to obtain more effective information, we add a channeltention module after Denseblock. It is composed of an Averagepooling layer and 2 3 × 3 convolutional layers and a nonlinear transformation ReLU and Sigmod layer, and the connection mode is shown in FIG. 3.
And step 3, training the model. Based on the model and the data set, the rapid training of the model is realized by using distributed training, and the process only needs 2.5 hours. Step 3, dividing training samples into a low-resolution training set and a high-resolution image block in a concentrated mode in the process of training the super night scene network model;
the acquisition mode of the low-resolution training set is as follows: firstly, carrying out N times of downsampling on a high-resolution image to obtain different low-resolution images; expanding the obtained low-resolution images, performing overlapping sampling on each obtained low-resolution image to obtain a group of overlapped low-resolution image blocks, and taking the group of overlapped low-resolution image blocks as a low-resolution training set;
the high resolution image block acquisition mode is as follows:
overlapping and sampling the high-resolution image corresponding to the N times of downsampling operation, and then taking a group of correspondingly overlapped high-resolution image blocks as a high-resolution label image; n is a positive integer;
the expansion mode of expanding the obtained low-resolution images is that the low-resolution images are subjected to rotation transformation of 90 degrees, 180 degrees and 270 degrees, so that low-resolution images of different angles are obtained;
a training convolutional network is then constructed:
firstly, taking a low-resolution image LR as an input, extracting shallow layer features through a convolution layer, then learning deep layer features of the image through a plurality of stacked CACBs, finally fusing the extracted shallow layer features and the deep layer features, and obtaining a high-resolution image through up-sampling in a sub-pixel convolution mode;
wherein the CACB module is composed of four fused convolutional layers, and one quarter of the features of each fused convolutional layer are reserved for final feature fusion; the structural details of the fusion convolution layer related in the module are divided into a training phase and a deployment phase;
the loss function used during the training process is:
Ltotal=0.5*L1+0.05*LSSIM+0.1*LVGG+Ladv
wherein L is1To mean absolute error, LSSIMFor structural similarity, LVGGFor perception of loss, LadvRepresenting a loss of confrontation;
wherein F (-) is the feature map of the output at layer 34 of VGG19 network pre-trained on ImageNet, G (I)i,j,kFor pictures generated by the generator, Ci,j,kD (-) is the output of the discriminator, for the corresponding original picture with the shot effect.
And 4, outputting a result.
Has the advantages that: the invention relates to an image night scene processing method based on a convolutional neural network, and further relates to a computing module capable of operating the method and a storage medium capable of being read by the computing module. By establishing a super night scene network model and training the model, only RAW data is taken out from a camera CMOS, a super night scene picture with excellent appearance can be obtained, the problems of image shake and afterimage caused by long-time exposure of the traditional night scene function are avoided, and the influence of image shake and afterimage when an AI technology is adopted to synthesize the image is further avoided.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art, that the present invention may be practiced without one or more of these specific details. In other instances, well-known features have not been described in order to avoid obscuring the invention.
The applicant thinks that the current principle of the super night scene function is to take a plurality of different ISO pictures with different exposures through long exposure and then synthesize the pictures, but the super night scene function is greedy, and the exposure time is usually more than a few seconds, so the requirements on mobile phone hardware and software algorithms are very high. In addition, the super night scene mode is difficult to select high-quality pictures in long-time exposure, because in long-time exposure, the hand can have uncontrollable slight shaking, and if the arrangement is not carried out by adopting an algorithm, the hand can combine the shaken pictures, so that the picture is in a problem. Some mobile phone manufacturers adopt AI technology to remove the blurred photos, then automatically align the scenery through system identification, and finally synthesize the scenery.
In order to solve the problems, a convolution neural network is used for realizing a super night scene mode, and for an algorithm, only RAW data are taken out from a camera CMOS (complementary metal oxide semiconductor), so that a super night scene picture with excellent appearance can be obtained. The specific algorithm flow is as follows:
step 1, data acquisition. And shooting night scenes by using a plurality of low-end mobile phones, taking out RAW format data, and simultaneously shooting long-exposure night scene RGB pictures by using single reflection, wherein the number of the long-exposure night scene RGB pictures is about 10 ten thousand pairs. After the data set is collected, an SIFT key point matching algorithm and an RANSAC algorithm are used for alignment operation, and non-overlapping parts of the SIFT key point matching algorithm and the RANSAC algorithm are removed. After matching is completed, the data is divided into a training set, a validation set and a test set.
Step 1-1, shooting different scenes through a plurality of preset image acquisition devices with different models to obtain a plurality of RAW format data samples, taking the RAW format data samples acquired by the image acquisition devices with different models in the same scene as a group of mother samples, distinguishing the mother samples into different sub-samples according to the models of the image acquisition devices, and marking each sample;
step 1-2, after data sample collection is completed, aligning the images, and removing non-overlapping parts of the images; the step 1-2 of aligning the images further comprises matching key points of the images, and repeatedly iterating for multiple times to obtain random subsets on the basis;
wherein the keypoints of the matching images are further as follows:
step 1-2a, searching all image positions on a preset scale space, and extracting key points including corner points, edge points, bright points of dark areas and dark points of bright areas through convolution operation; the calculation method of the scale space L (x, y, sigma) is as follows:
L(x,y,σ)=G(x,y,σ)·C(x,y)
in the formula, C (x, y) represents the midpoint coordinate of the key point, G (x, y, sigma) represents a Gaussian kernel function, and sigma is a scale space factor and takes a fixed value;
wherein the gaussian kernel function is represented as follows:
wherein each symbol has the same meaning as above;
step 1-2b, collecting gradient module values of key points:
step 1-2c, collecting the direction distribution of key points:
wherein each symbol has the same meaning as above;
step 1-2d, calculating a domain point k of the key point ki:
In the formula (x)k,yk) The orientation of the key points is shown, and the other symbols have the same meanings as above.
And 1-3, after the image alignment operation of the step 1-2, further dividing the data sample into a training set, a verification set and a test set.
And 2, designing a model. Referring to fig. 1 and fig. 2 in detail, first, we propose a Super Night Network (SNN), which has a main body of an Encoder-Decoder structure as shown in fig. 2. It consists of an Encoder (left side) and a Decoder (right side). Similar to a common classification network, the Encoder has multiple times of down sampling, and a feature map is changed from large to narrow to small and wide. In each layer there are two 3x3 convolutions, each convolution followed by an activation function (learkrlu) and a switchablenmapping layer, and finally a 2x2max _ posing operation, i.e. step size of 2 for downsampling. The number of signature channels will be doubled in each down-sampling step. This step needs to be repeated three times for the entire Encoder.
Each step in Decoder consists of upsampling the signature, here using nearest interpolation, followed by a convolution of 3x3 with half the number of signature channels, concatenated with the corresponding signature (after special processing) from the Encoder, followed by a convolution of the concatenated signature by two 3x3, again each convolution followed by an activation function, lakyrelu, and a switchablenwarping layer. In the last layer, only one 3 × 3 convolution layer is used, and the processed image is output through pixel _ shuffle.
In order to obtain more information from the RAW data, a skip-connection structure is used, a Residual sense block is proposed and placed on the skip-connection structure. The ResidualDensblock is composed of 3 Densblocks, each Densblock is internally provided with 5 layers of convolution, each layer of convolution is followed by an activation function (LEAKyReLU) and a SwitchabeReynarestriction layer, and meanwhile, each layer receives output characteristic graphs from all the previous convolution layers, and the operation is serial splicing. In addition, to obtain more effective information, we add a channeltention module after Denseblock. It is composed of an Averagepooling layer and 2 3 × 3 convolutional layers and a nonlinear transformation ReLU and Sigmod layer, and the connection mode is shown in FIG. 3.
And step 3, training the model. Based on the model and the data set, the rapid training of the model is realized by using distributed training, and the process only needs 2.5 hours. Step 3, dividing training samples into a low-resolution training set and a high-resolution image block in a concentrated mode in the process of training the super night scene network model;
the acquisition mode of the low-resolution training set is as follows: firstly, carrying out N times of downsampling on a high-resolution image to obtain different low-resolution images; expanding the obtained low-resolution images, performing overlapping sampling on each obtained low-resolution image to obtain a group of overlapped low-resolution image blocks, and taking the group of overlapped low-resolution image blocks as a low-resolution training set;
the high resolution image block acquisition mode is as follows:
overlapping and sampling the high-resolution image corresponding to the N times of downsampling operation, and then taking a group of correspondingly overlapped high-resolution image blocks as a high-resolution label image; n is a positive integer;
the expansion mode of expanding the obtained low-resolution images is that the low-resolution images are subjected to rotation transformation of 90 degrees, 180 degrees and 270 degrees, so that low-resolution images of different angles are obtained;
a training convolutional network is then constructed:
firstly, taking a low-resolution image LR as an input, extracting shallow layer features through a convolution layer, then learning deep layer features of the image through a plurality of stacked CACBs, finally fusing the extracted shallow layer features and the deep layer features, and obtaining a high-resolution image through up-sampling in a sub-pixel convolution mode;
wherein the CACB module is composed of four fused convolutional layers, and one quarter of the features of each fused convolutional layer are reserved for final feature fusion; the structural details of the fusion convolution layer related in the module are divided into a training phase and a deployment phase;
the loss function used during the training process is:
Ltotal=0.5*L1+0.05*LSSIM+0.1*LVGG+Ladv
wherein L is1To mean absolute error, LSSIMFor structural similarity, LVGGFor perception of loss, LadvRepresenting a loss of confrontation;
wherein F (-) is the feature map of the output at layer 34 of VGG19 network pre-trained on ImageNet, G (I)i,j,kFor pictures generated by the generator, Ci,j,kD (-) is the output of the discriminator, for the corresponding original picture with the shot effect.
And 4, outputting a result.
Before outputting a result, rendering the image through a preset shot scene model, wherein the model is constructed as follows:
wherein, I
bokehRepresenting the finally obtained image, I
orgWhich represents the original image or images of the original image,
representing the multiplication of the matrix element by element, B
i(. h) is the i-th order blur function, W
iA characteristic weight matrix value representing an i-th layer data image,
involving the i-th order blur function B
i(. is a shallow fuzzy neural network
Obtained i iterations, which is expressed as:
the loss function l adopts the combination of a reconstruction function and a structural similarity SSIM, and optimizes a model through the back propagation of an error value; wherein l1The method specifically comprises the following steps:
wherein, I
bokehThe representation model generates an image with a shot effect,
an original image representing an image with an actual shot effect,
representing the generated image I
bokehWith the actual image
The structural similarity between the two is as follows:
wherein alpha, beta and gamma are preset constants,
representing the generated image I
bokehWith the actual image
The relationship between the brightness of the light source and the brightness of the light source,
representing the generated image I
bokehWith the actual image
The contrast ratio relationship between the two components,
representing the generated image I
bokeWith the actual image
Structural relationship between them.
In fig. 4, the left side is a night scene picture taken by red rice 8, and the right side is a night scene picture obtained by our network (both data are RAW data of red rice 8). The effect of the red rice wine is richer than the details of the red rice, and the red rice wine is soft in color and accords with the visual sense of human eyes.
In conclusion, the algorithm flow can effectively improve the shooting requirement of the low-end mobile phone in the night scene, and meanwhile, the cost is lower. This is a super night scene technology that costs less money for consumers who purchase low end phones, but can be used with high end phones. In addition, the algorithm greatly reduces the requirement of the super night scene algorithm on hardware, and if the super night scene algorithm is matched with an Airia NPU chip, the cost is further reduced, and the cost performance of the mobile phone is improved.
As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.