CN111563577B

CN111563577B - Unet-based intrinsic image decomposition method for skip layer frequency division and multi-scale identification

Info

Publication number: CN111563577B
Application number: CN202010319106.3A
Authority: CN
Inventors: 蒋晓悦; 方阳; 王鼎; 李煜祥; 冯晓毅
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2022-03-11
Anticipated expiration: 2040-04-21
Also published as: CN111563577A

Abstract

The invention provides an intrinsic image decomposition method based on frequency division of a jump layer and multi-scale identification of Unet, which constructs a generation countermeasure network based on Unet, wherein the network consists of a generator and an identifier, the generator is used for decomposing an image into a reflection diagram and a light diagram, and the identifier is used for judging whether the image is true or false and guiding the generator to generate the image which is false or not. The network designed by the invention effectively relieves the problem caused by directly transmitting the characteristics of the encoder to the decoder. On one hand, the constraint of frequency decomposition is added in the jump connection of the reflection graph Unet, so that the network can learn the importance degree of different characteristics to obtain more suitable characteristics. On the other hand, by adding frequency decomposition and frequency compression to the hopping connection of the illumination map, not only can a more appropriate characteristic map be obtained, but also the problem of more high-frequency components in the illumination map is solved.

Description

Unet-based intrinsic image decomposition method for skip layer frequency division and multi-scale identification

Technical Field

The invention belongs to the field of image processing, and particularly relates to an intrinsic image decomposition method.

Background

Image recognition applications have emerged in various parts of life, such as face recognition, target tracking, unmanned driving, and the like. However, in the imaging process, due to the influence of many environmental factors such as the illumination intensity, the incident angle of light, and the shadow shielding, the imaging effect may be poor, so that the image recognition is difficult and the accuracy is reduced. One way to solve this problem is to extract features that do not change with environmental factors, i.e., intrinsic images, from multi-modal images. The intrinsic image is the inherent characteristics of the object independent of the environmental factors, including color, texture, material and the like, the inherent characteristics are not changed along with the change of the environmental factors, and if the intrinsic information of the object such as color, texture, material and the like can be separated from the environment from the image and the image part influenced by the environment is filtered, more accurate characteristic description of the object can be obtained. The intrinsic image decomposition is to extract the inherent features of an image, and the intrinsic image decomposition is to decompose the image into a reflection image with textures, colors and materials and a light map with shape information and illumination information. The reflection map is not changed along with the environment, and the decomposed reflection map can be used as the input of other image understanding tasks, so that the difficulty of image analysis is greatly reduced, and the image understanding has the robustness of unchanged illumination.

The intrinsic image decomposition is mainly divided into two types according to the algorithm type, the first type is the intrinsic image decomposition based on Retinex theory, and the second type is the intrinsic image decomposition problem based on deep learning.

Intrinsic image decomposition based on Retinex theory decomposes an image according to local gradient changes of the image, and it is considered that large gradient changes are generally caused by different materials on the surface of an object, namely, different colors caused by material reflection, so that the gradient changes greatly, and small gradient changes are generally caused by illumination. This theory assumes that the illumination is slowly varying and uniform, with no abrupt changes. However, in reality, the illumination is suddenly changed due to problems such as shading, and the large gradient change of the illumination is caused, so that the Retinex theory is not applicable.

The intrinsic image decomposition method based on deep learning basically solves the above problems, but does not pay attention to the frequency property of the intrinsic image when designing the network structure. On one hand, the characteristic diagrams of the encoder reaching the decoding end of the reflection diagram through the jump connection are not well combined, and the influence of some high-frequency characteristics on the reflection diagram is larger. On the other hand, the high frequency components in the illumination map obtained after the image decomposition are reduced, but the encoder directly transmits the high frequency components through the jump connection, so that more high frequency noise exists on the image illumination map.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an intrinsic image decomposition method based on skip layer frequency division and multi-scale discrimination of Unet, a generation countermeasure network based on Unet is constructed, the network consists of a generator and a discriminator, the generator is used for decomposing an image into a reflection map and a light map, and the discriminator is used for judging whether the image is true or false and guiding the generator to generate the image which is false or false. The network designed by the invention effectively relieves the problem caused by directly transmitting the characteristics of the encoder to the decoder. On one hand, the constraint of frequency decomposition is added in the jump connection of the reflection graph Unet, so that the network can learn the importance degree of different characteristics to obtain more suitable characteristics. On the other hand, by adding frequency decomposition and frequency compression to the hopping connection of the illumination map, not only can a more appropriate characteristic map be obtained, but also the problem of more high-frequency components in the illumination map is solved.

In order to achieve the purpose, the invention provides an intrinsic image decomposition method based on Unet hopping layer frequency division and multi-scale discrimination, which comprises the following steps:

step 1: constructing a training image sample library

Randomly extracting B images from a test image data set, randomly sampling M small images with the size of N x N from each image, horizontally overturning the M small images to obtain new M small images, and obtaining 2 x M small images from each image; all the 2M B small images obtained from the extracted B images form a training image sample library;

step 2: structure generator

Step 2-1: structural reflection diagram generator

In the Unet network, adding a frequency decomposition submodule into each hop layer, wherein the input of the frequency decomposition submodule is a characteristic diagram output by an Unet network encoder, the output of the frequency decomposition submodule is a new characteristic diagram after frequency decomposition, and the new characteristic diagram is input to a decoder of the Unet network; the Unet network at this time is a constructed reflection map generator, and an image is input in the reflection map generator and is output as a reflection map of the input image;

the frequency decomposition sub-module performs the following frequency decomposition process:

defining the characteristic diagram of the i layer of the Unet network encoder as

c represents the number of channels, h represents the height of the feature map, w represents the width of the feature map, and the feature map is subjected to global maximum pooling:

in the formula (1)

The Global max pooling represents a feature map obtained after the Global max pooling operation;

the result obtained by the formula (1) is then passed through a full contact layer FC₁And the number of channels of the compressed feature map is as follows:

in the formula (2)

To pass through full connection layer FC₁Obtaining a characteristic diagram;

and then the result obtained by the formula (2) passes through a Relu activation function layer:

in the formula (3)

The characteristic diagram is obtained after the Relu activation function layer is passed;

the result obtained by the formula (3) is passed through a full connection layer FC₂And restoring the initial channel number of the feature diagram:

in the formula (4)

To pass through full connection layer FC₂Obtaining a characteristic diagram;

then, the result obtained by the formula (4) is processed by a sigmoid activation function layer to obtain a normalized weight parameter:

finally, normalizing the weight parameters

And ith layer feature map

Multiplying to obtain a new characteristic diagram after frequency decomposition:

step 2-2: structured illumination pattern generator

In the Unet network, adding a frequency decomposition submodule and a channel compression submodule to each hop layer; the input of the frequency decomposition submodule is a characteristic diagram output by the Unet network encoder, the frequency decomposition submodule carries out frequency decomposition on the characteristic diagram according to formulas (1) to (6) in the step 2-1 and outputs a new characteristic diagram to the channel compression submodule; the channel compression submodule of each jump layer sets different compression ratios according to the position of the layer at the Unet network encoder to perform channel compression and outputs a final characteristic diagram to a decoder of the Unet network; the Unet network at this time is a constructed illumination map generator, and an image is input in the illumination map generator and output as an illumination map of the input image;

and step 3: structure discriminator

The discriminator consists of four layers of convolutional neural networks; when the reflection map generator or the illumination map generator is trained, the reflection map or the illumination map output by the reflection map generator or the illumination map generator is input into the discriminator, the discriminator compares the input reflection map or the illumination map with the label image, and the probability that the reflection map or the illumination map is consistent with the label image is output;

the reflection map generator or the illumination map generator is respectively combined with the discriminator to train the reflection map generator or the illumination map generator;

and 4, step 4: defining a loss function

Step 4-1: define the generator loss function as:

L_G＝L_GAN-G+L_mse+L_cos+L_bf+L_feat (7)

wherein L is_GAN-GRepresenting the inherent loss function, L_mseRepresenting the mean square error function, L_cosRepresenting the cosine loss function, L_bfRepresenting a cross-bilateral filtering loss function, L_featRepresenting a characteristic loss function;

intrinsic loss function L_GAN-GThe calculation formula of (a) is as follows:

in the formula W_iExpressing the normalization weight parameter of the i-th network, i expressing the network layer number, fake _ output_iThe probability that the output image is false is shown, ones is shown as 1, and x is shown as the number of network layers;

mean square error function L_mseThe calculation formula of (a) is as follows:

in the formula, fake _ image_iOutput, true _ image, representing the characteristic map of the i-last layer of the decoder_jAn image tag representing scaling an input image by j times;

cosine loss function L_cosThe calculation formula of (a) is as follows:

in the formula, fake _ region_kThe kth block region, true _ region, representing the generated image_kA k-th block area representing a label image, y representing the number of image areas;

cross bilateral filter loss function L_bfThe calculation formula of (a) is as follows:

wherein bf represents double sideband filtering, C represents label image, { A, S } represents reflection map and illumination map set, J_pRepresenting the output of the bilateral filter, C_pValue, N, representing the p-th pixel of the label image_pDenotes the total number of p pixels and neighboring pixels, W_pDenotes the normalized weight, q denotes the sequence number of the neighboring pixel of p, n (p) denotes the set of neighboring pixel positions of the p-th pixel,

representing a spatial gaussian kernel, p represents the serial number of the p-th pixel,

denotes the range Gaussian kernel, C_qRepresents the value of the neighboring pixel q;

characteristic loss function L_featThe calculation formula of (a) is as follows:

where l denotes the l-th layer of the network, F_lNumber of channels, H, representing the characteristic diagram of the l-th layer_lDenotes the height of the ith layer profile, W denotes the width of the ith layer profile,

a feature activation value representing the ith layer of the image,

a representation generator output image;

step 4-2: the discriminator loss function is defined as follows:

where zeros represents a probability of 0,

representing the probability that the output image is true;

and 5: network training

Respectively training a combination of a reflection map generator and a discriminator and a combination of a light map generator and the discriminator by using the training image sample library constructed in the step 1, updating network parameters by adopting an Adam optimization method, and stopping training when the loss function value defined in the step 4 is minimum to obtain a final reflection map generator and a final light map generator;

step 6: and inputting the original image to be processed into the step 5 to obtain a reflection map generator or an illumination map generator, wherein the output image is a reflection map or an illumination map obtained by decomposing the original image.

The invention has the beneficial effects that: due to the adoption of the intrinsic image decomposition method based on the frequency division of the jump layer and the multi-scale identification of the Unet, the problem of the equality of each feature of a feature map generated by the native jump connection and the problem of high-frequency noise introduced to a light map decoder are solved.

Drawings

FIG. 1 is a flow chart of the intrinsic image decomposition method of the present invention.

Fig. 2 is a schematic diagram of the frequency decomposition submodule structure of the present invention.

FIG. 3 is a schematic diagram of a reflection map generator network architecture in accordance with the present invention.

Fig. 4 is a schematic diagram of a network architecture of a light pattern generator according to the present invention.

Fig. 5 is a schematic diagram of the network structure of the discriminator of the present invention.

FIG. 6 is an illustration of the results of the method of the present invention, wherein FIG. 6(a) is the original image, FIG. 6(b) is the reflectance map, and FIG. 6(c) is the illumination map.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

As shown in fig. 1, the present invention provides an eigen-image decomposition method based on the frequency division of skip layer and multi-scale discrimination of the Unet, which comprises the following steps:

step 1: constructing a training image sample library

step 2: structure generator

Step 2-1: structural reflection diagram generator

in the formula (1)

in the formula (2)

To pass through full connection layer FC₁Obtaining a characteristic diagram;

in the formula (3)

in the formula (4)

To pass through full connection layer FC₂Obtaining a characteristic diagram;

finally, normalizing the weight parameters

And ith layer feature map

step 2-2: structured illumination pattern generator

and step 3: structure discriminator

and 4, step 4: defining a loss function

Step 4-1: define the generator loss function as:

L_G＝L_GAN-G+L_mse+L_cos+L_bf+L_feat (7)

intrinsic loss function L_GAN-GThe calculation formula of (a) is as follows:

mean square error function L_mseThe calculation formula of (a) is as follows:

cosine loss function L_cosThe calculation formula of (a) is as follows:

a feature activation value representing the ith layer of the image,

a representation generator output image;

step 4-2: the discriminator loss function is defined as follows:

where zeros represents a probability of 0,

representing the probability that the output image is true;

and 5: network training

Example (b):

(1) constructing a training image sample library

By adopting the MPI image data set, the scenes which are commonly used in the MPI data set have 9 major categories, two minor categories are arranged under each major category, and each minor category has 50 pictures. There are two segmentation modes in constructing the training image sample library, one is based on image-split mode, and the other is scene-split mode.

In the image-split mode, of 18 subclasses of a picture data set, each subclass extracts half of images, each image size is 1024x436, 10 small images with the size of 256x256 are randomly sampled in the images, and then the small images are horizontally flipped, so that each image obtains 20 small images. The training data set totaled 9000(18x25x20) small images of size 256x256, and the test data set used a total of 450(18x25) large images of size 1024x 436.

In the Scene-split mode, one subclass is taken for training in each major class, the other subclass is used for testing, two defective subclasses, namely "bandwidth 1" and "shamman 3", are removed, the same method is used for acquiring small images in the image-split mode, and 9000(9x50x20) small images with the size of 256x256 in the training data set and 350(7x50) large images with the size of 1024x436 in the testing data set are obtained.

(2) As shown in fig. 2 to 5, the reflection pattern generator and the light pattern generator are constructed by the method of step 2, and the used Unet network is a four-layer network. The encoder of the reflection map generator Unet network takes a convolution layer, a batch normalization layer and a LeakyRelu activation function layer as down-sampling blocks, the step length of the convolution layer is 2, and the size of a characteristic map is halved after each convolution operation. The output hopping connections of each active function layer in the encoder are fed into a frequency decomposition submodule, and the channel variation of the encoder is [3,32,64,128,256 ]. The output of the frequency decomposition submodule is fed to a decoder, the step size of the decoder convolutional layer is 1.

The channel compression submodule of the illumination map generator is performed through the convolution layer, the step length of the convolution layer is 1, the size of the characteristic map is not changed, and the number of channels is compressed in different proportions. The high-frequency components in the illumination map are few, and the compression ratio is large; the low frequency component is more, and the compression ratio is small.

(3) The discriminator is a four-layer convolutional neural network, the convolutional layer has the step size of 2, the channel variation of each convolutional layer is reduced by half after passing through one convolutional layer, the channel variation of the four convolutional layers is respectively 3 to 64, 64 to 128, 128 to 256 and 256 to 512, and the output of each convolutional layer is compressed into a single-channel characteristic probability map. When the discriminator determines true, all the single-channel feature probability maps are close to 1, and when the discriminator determines false, the single-channel feature probability map is close to 0.

(4) And (4) calculating a generator loss function according to the step 4, wherein the weights of the first layer and the last layer of the generator network in the inherent loss are set to be 4, and the weights of the middle two layers are set to be 1.

When the mean square error is calculated, the feature maps of the reciprocal 3 layers of the decoder are taken to generate complete, half and quarter original maps respectively, three different scales are constrained, and the weights of the three scales are 1, 0.8 and 0.6 respectively.

In order to better maintain the edge characteristics when the cosine loss is calculated, the edges of the generated image and the label image are kept consistent, the input image is divided into 4 blocks, and the cosine similarity of each block and the corresponding label block is ensured to be consistent.

When calculating the discriminator loss function, the weights of the first layer and the last layer are 4, and the weights of the two middle layers are 1.

(5) And training by using samples of a training image sample library, respectively using different generators and discriminators for a reflection map and an illumination map, and separately training by using network models of the illumination map and the reflection map which are consistent. The network is optimized by adopting an Adam optimization method, different Adam optimizers are needed for a generator and a discriminator, optimizer parameters beta are set to be (0.5,0.999), the learning rate is 0.0005, weight _ decay is 0.0001, and the batch size is 20. The generator and the discriminator employ alternating training (TTUR), the number of trains of the discriminator being 5 to 1 compared to the number of trains of the generator.

(6) As shown in fig. 6, the original image to be processed is input into the reflection map generator or the illumination map generator, respectively, and the output image is the reflection map or the illumination map obtained by decomposing the original image.

To quantitatively evaluate the performance of the method of the invention, tests were performed on the MPI Intrinsic Image dataset and compared with the algorithm provided in the document Fan, et al, "reviewing Intrinsic Image demographics." Computer Vision (CVPR),2018IEEE International Conference on. IEEE,2018, the comparison results are shown in table 1 (bold represents the optimal index values).

TABLE 1 Performance indices of several intrinsic image decomposition methods

As can be seen from Table 1, the method of the invention achieves the optimal performance on MSE, LMSE and DSIM, and is improved by a lot compared with the existing method in index, thereby fully illustrating the effectiveness and the practicability of the method of the invention.

Claims

1. An intrinsic image decomposition method based on Unet skip layer frequency division and multi-scale discrimination is characterized by comprising the following steps:

step 1: constructing a training image sample library

step 2: structure generator

Step 2-1: structural reflection diagram generator

in the formula (1)

then the junction obtained by the formula (1)The fruit is passed through a full junction layer FC₁And the number of channels of the compressed feature map is as follows: