CN114612727A

CN114612727A - Essential image decomposition method research based on bilateral feature pyramid network and multi-scale identification

Info

Publication number: CN114612727A
Application number: CN202210290919.3A
Authority: CN
Inventors: 蒋晓悦; 王众鹏; 冯晓毅; 夏召强; 韩逸飞
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2022-06-10
Also published as: CN116188791A

Abstract

Aiming at the intrinsic image decomposition task, the invention provides a reconstruction method of parallel local frequency division selection, which can realize accurate reconstruction of a reflection map and an illumination map. Intrinsic image decomposition is an under-constrained problem. The essential image reconstruction based on the coding and decoding network provides an effective solution, but the result of the solution still has shortcomings, so that more accurate selection of information of each frequency band is needed to obtain a more accurate decomposition result. The network structure provided by the invention takes two parallel generation countermeasure networks as main networks to respectively reconstruct a reflection graph and an illumination graph. Aiming at a generated network, the invention provides a strategy of local partial frequency feature fusion, and the selection and the reservation of high-frequency reflection features and low-frequency illumination features are respectively realized. Meanwhile, the multi-scale self-adaptive combination module is added into the discriminator, so that the contribution of the multi-scale features is self-adaptively evaluated, the discrimination effect is enhanced, and the generation effect is improved. Further, the present invention constructs various loss functions to constrain the generated results and facilitate training of the network. The algorithm provided by the invention has excellent performance on various data sets. In the MPI-Sintel data set, compared with other methods, the reconstruction mean square error of the optimal result of the method is reduced by 13.26 percent; in the Shapelet data set, the reconstructed mean square error of the optimal result of the invention is reduced by 26.09% compared with that of other methods.

Description

Essential image decomposition method research based on bilateral feature pyramid network and multi-scale identification

The technical field is as follows:

the invention belongs to the field of image processing, and particularly relates to an essential image decomposition method.

The prior art is as follows:

since many environmental factors change during the imaging process of the object, the appearance characteristics of the scene are affected to different degrees, including: different illumination intensity, different illumination incidence angles, shadow shielding, posture change and the like. These changes will eventually be manifested in the image, and different images will be acquired for the same target object under different environmental conditions, thus also bringing some difficulties to the subsequent image understanding task.

In order to solve this problem, it is necessary to extract a feature that does not change with changes in environmental factors, i.e., an essential feature image, from an image. The essential features are the inherent features of the object itself, which are not affected by environmental factors. For an object, its inherent features include color, texture, and material. These intrinsic characteristics are not changed with changes in environmental factors. If we can separate the essential characteristic information of the object such as color, texture and material from the environment information, and filter out the image component affected by the environment, then we can get more accurate characteristic information description of the object. This is beneficial to the performance improvement of other image processing tasks.

The essential feature decomposition of images is one of the low-order tasks of the computational visual domain. The intrinsic feature extraction is to extract intrinsic features, which decompose an image into two parts, namely a reflection map with color, texture and material and a light map with shape information and illumination information. The reflection map is not changed along with the change of environmental factors, so the decomposed reflection map can be used as the input of other image processing tasks, the difficulty of image analysis is greatly reduced, and the image processing has the robust characteristic of illumination invariance. The reflectance and illumination maps of the image are shown in fig. 1. Meanwhile, with the continuous deep research of the deep learning method, the essential characteristic image analysis algorithm based on the convolutional neural network makes great progress in real-time performance and accuracy, and the theoretical basis of tamping is laid for improving the robustness of high-level image processing tasks such as unmanned driving and the like and accelerating the industrial application speed of the high-level image processing tasks.

The solution of the intrinsic image decomposition is mainly divided into the following two types, namely, the first type is an optimization method based on explicit constraint, and the second type is a deep learning method based on implicit constraint.

The optimization method based on the explicit constraint mostly adopts the prior constraint of the essential image, the optimization problem is solved in a constraint domain, the algorithm performance depends on the reasonability of the prior constraint and the convergence performance of the convex optimization function, and the reasonable prior constraint and the good optimization function can avoid the model from converging to the local optimal solution. Explicit constraint based optimization methods do not require tags but have limited application. In the actual situation, the illumination condition is very complicated, the problems of highlight, shielding, mirror reflection and the like can occur, and the proposed prior constraint cannot comprehensively handle the conditions. Implicit constraints obtained based on label learning can generalize the situations more than explicit constraints, and are the mainstream research direction at present.

The purpose of the invention is as follows:

in order to overcome the defects of the prior art, the invention explores and provides a brand-new essential image algorithm based on generation of a countermeasure network. In the generator aspect, a U-Net algorithm is taken as a prototype, the invention creatively adds a deformed bilateral characteristic pyramid module in a jump layer of the U-Net, and the encoder characteristics are sent to a decoder after being subjected to enhanced selection, thereby improving the effect of essential image decomposition. The invention adds a multi-scale self-adaptive combination module in the discriminator, predicts at a plurality of characteristic scales, strengthens the discrimination effect and further improves the generation effect. On one hand, the constraint of frequency decomposition is added in the jump connection of the reflectometer U-Net, so that the network can learn the importance degree of different characteristics to obtain more suitable characteristics. On the other hand, frequency decomposition and frequency compression are added to the hopping connection of the illumination map, so that not only can a more appropriate characteristic map be obtained, but also the problem of high frequency components in the illumination map is solved.

The invention content is as follows:

in order to achieve the above object, the present invention provides an essential image decomposition method based on bilateral feature pyramid and multi-scale identification, a network structure of which is shown in fig. 2, and the method comprises the following steps:

step 1: construction of training image sample library

(1) Constructing a training image sample library

Randomly extracting a certain number of images from a test image data set, then randomly sampling M small images with the size of N x N from each image, and horizontally or vertically overturning the M small images to obtain new M small images, so that each image obtains 2 x M small images; and all the extracted images are subjected to the operation to obtain small images, and the small images form a training image sample library.

Step 2: construct generators

(1) Constructed reflection map generator

The generator network part takes a U-Net network structure as a template, and introduces a bilateral characteristic pyramid network into a jump connection layer from an encoder to a decoder of the U-Net so as to enhance the capability of the encoder to decompose effective characteristics of an original image and inhibit the generation of invalid characteristics. The network adopts a symmetrical structure, and the codecs have 5 layers. Each layer of the encoder is composed of a downsampled layer and a convolutional layer. The convolutional layer corresponding channels are 16, 32,64,128 and 256 in sequence. Each layer of the decoder consists of an upsampled layer and a convolutional layer. Except that the last convolutional layer of the encoder and the last convolutional layer of the decoder do not use a batch normalization layer, other convolutional layers are unified to add the batch normalization layer in the convolutional operation to accelerate the network training. According to the prior knowledge, the activation functions of the first layers of the generator network adopt a Leaky-ReLU structure, and the activation functions of the last output layer are activated by Tanh. At this time, an image is input to the reflection map generator, and the reflection map is output as a reflection map of the input image.

The BIFPN-A module is used for generating A reflection map, and the BIFPN-B module is used for generating A light map. Unlike a general hop-and-connect layer, a bipfn module has its inputs not just one channel, but 5 hop-and-connect channels. The channel frequency gradually increases from top to bottom. Both BIFPN modules are formed by stacking 3 BIFPN blocks. The BIFPN Block concretely comprises a reflection graph end and a light graph end. The Block structure of the BIFPN-A module is shown in FIG. 3. It keeps the operation of calculating the intermediate feature from low frequency to high frequency, and then combining the intermediate feature from high frequency to low frequency to calculate the output feature. Because the high-frequency characteristics can guide the synthesis of the low-frequency characteristics, the output result contains richer high-frequency characteristics. The Block of fig. 3 divides the 5 paths into a high frequency path and a medium and low frequency path. The high frequency path comprises the lower two layers, and the medium and low frequency path comprises the upper three layers. When the intermediate features are calculated, the two paths need to be calculated respectively, and the advantage of doing so is that the high-frequency features and the middle-low frequency features can be isolated to a certain extent and are not simply fused together directly.

(2) Construction of a photopatterning map

For the Block of BIFPN-B, its structure is schematically shown in FIG. 4. It is compared to fig. 3, where the operations of upsampling and downsampling are merely switched. It calculates the intermediate feature from high frequency to low frequency, and then calculates the output feature from low frequency to high frequency combined with the intermediate feature. The 5 different scale sized features input to the two bipfn networks will first be input together into the bipfn Block. First, these features of different scales need to be unified into 64 channels for operation. Then, the network distributes the input features according to different weights and performs fusion calculation on the features with different scales to finally obtain the fused features. BIFPN-B increases the final channel compression compared to BIFPN-A. It consists of a 1 × 1 convolutional layer and a Leaky ReLU active layer. This 1 x 1 convolutional layer reduces the number of channels of the input features to one eighth of the original number, so that the light pattern can remove the high frequency features as much as possible without losing important high frequency information.

And step 3: structure discriminator

The discriminator consists of four layers of convolutional neural networks. The network structure is shown in fig. 5. When the reflection map generator or the illumination map generator is trained, the reflection map or the illumination map output by the reflection map generator or the illumination map generator is input into the discriminator, the discriminator compares the input reflection map or the illumination map with the label image, and the probability that the reflection map or the illumination map is consistent with the label image is output.

The reflection map generator is used in combination with a discriminator to train the reflection map generator. The illumination pattern generator is used in combination with a discriminator to train the illumination pattern generator.

And 4, step 4: defining a loss function

(1) The generator loss is defined as shown in equation (1):

L_G＝L_GAN-G+L_mse+L_cos+L_bf+L_feat (1)

wherein L is_GAN-GRepresenting the inherent loss function, L_mseRepresenting the mean square error function, L_cosRepresenting the cosine loss function, L_featA characteristic loss function is represented.

Intrinsic loss function L_GAN-GIs shown in formula (2):

wherein, W_iRepresents the normalized weight parameter of the i-th layer, i represents the network layer number, fake _ output_iIndicating a probability that the output image is false and ones indicating a probability of 1.

Mean square error function L_mseIs calculated asFormula (3) shows:

wherein, the fake _ image_iOutput, true _ image, representing the characteristic map of the i-last layer of the decoder_iAn image tag representing a zoom of i times.

Cosine loss function L_cosThe calculation formula (2) is shown in formula (4):

wherein, the fake _ region_iThe ith block region, true _ region, representing the generated image_iIndicating the ith block region of the generated image.

Cross bilateral filtering loss L_bfThe calculation formula (2) is shown in formulas (5) to (6):

wherein L is_bfRepresenting double sideband filtering loss, bf double sideband filtering, C label image, { a, S } respectively reflection and illumination maps, J_pRepresenting the output of the bilateral filter, C_pValue, N, representing the p-th pixel of the label image_pDenotes the total number of p pixels and neighboring pixels, W_pRepresenting normalized weights, q represents the neighbor pixel position of p, n (p) represents the set of neighbor pixel positions of the p-th pixel,

representing a spatial gaussian kernel, p represents the position of the p-th pixel,

denotes the range Gaussian kernel, C_qRepresenting the value of the neighboring pixel q.

L_featThe calculation formula (c) is shown in formula (8):

where l denotes the l-th layer of the VGG network, F_lNumber of channels, H, representing the characteristic diagram of the l-th layer_lDenotes the height of the ith layer profile, W denotes the width of the ith layer profile,

representing a characteristic activation value of the l-th layer;

(2) the discriminator loss is defined as shown in equation (9):

wherein L2 represents L₁Loss, y_iRepresents a ground-truth image, f (x)_i) The super-resolved image is shown.

And 4, step 4: network training

Respectively training the combination of the reflection map generator and the discriminator and the combination of the illumination map generator and the discriminator in the step 2 by using the training image sample library constructed in the step 1, updating network parameters by adopting an Adam optimization method, and stopping training when the loss function value defined in the step 3 is minimum to obtain a final reflection map generator and an illumination map generator;

and 5: and (3) respectively inputting the original image to be processed into the reflection map generator or the illumination map generator obtained in the step (5), wherein the output image is the reflection map or the illumination map obtained by decomposing the original image.

Has the advantages that:

according to the method, an essential image decomposition method based on a bilateral characteristic pyramid and multi-scale identification is adopted, and aiming at the problem that the existing method lacks communication and guidance for characteristics among different frequencies, a bilateral characteristic pyramid network is innovatively introduced, so that a reflection map and a light map respectively obtain characteristic information beneficial to reconstruction of the opposite side.

Description of the drawings:

FIG. 1 is a schematic view of a reflection chart and a light chart

FIG. 2 is a schematic diagram of an eigen-decomposition network based on bilateral feature pyramid and multi-scale identification

FIG. 3 is A schematic diagram of BIFPN-A network structure

FIG. 4 is a schematic diagram of BIFPN-B network structure

FIG. 5 is a schematic diagram of a network structure of a discriminator

FIG. 6 is a schematic diagram of an MPI-Sintel dataset

FIG. 7 is a diagram of a ShapeNet eigen image

FIG. 8 is a schematic diagram showing comparison of test results of different module combinations

FIG. 9 is a schematic diagram of a first BIFPN module network structure

FIG. 10 is a schematic diagram of a second BIFPN module network structure

FIG. 11 is a diagram illustrating the decomposition and comparison of intrinsic images of the present invention with other methods under image segmentation

FIG. 12 is a schematic diagram illustrating the decomposition and comparison of intrinsic images according to the present invention and other methods under scene segmentation

FIG. 13 is a schematic diagram illustrating comparison of local area effects under scene segmentation

FIG. 14 is a graph illustrating comparison of effects of partial test data of ShapeNet data set

The specific implementation mode is as follows:

the present invention will be further described with reference to the following examples.

Example 1:

step 1: constructing a training image sample library

The invention employs an MPI image dataset based on a complex scene and a ShapeNet image dataset based on a single composite object. There are 9 major categories of commonly used scenes in the MPI dataset, two minor categories under each major category, and 50 pictures in each minor category. FIG. 6 shows a portion of the data of the MPI-Sintel data set. There are two segmentation modes in constructing the training image sample library, one is based on image-split mode, and the other is scene-split mode.

In the image-split mode, of 18 subclasses of a picture data set, each subclass extracts half of the image, each image is 1024x436 in size, 10 small images of 256x256 in size are randomly sampled in the image, and then the small images are horizontally turned upside down, so that each image obtains 20 small images. The training data set totaled 9000(18x25x20) small images of size 256x256, and the test data set used a total of 450(18x25) large images of size 1024x 436.

In the Scene-split mode, one subclass is taken for training in each major class, the other subclass is used for testing, two defective subclasses, namely "bandwidth 1" and "shamman 3", are removed, the same method is used for acquiring small images in the image-split mode, and 9000(9x50x20) small images with the size of 256x256 in the training data set and 350(7x50) large images with the size of 1024x436 in the testing data set are obtained.

The ShapeNet dataset is a large-scale dataset of 3D shapes. It is a computer-synthesized dataset in which each image provides a reflectance map, an illumination map, a surface normal, a depth, and scene illumination conditions. The images of the ShapeNet dataset are fully aligned, and for the intrinsic image decomposition task, only the illumination and reflection maps need to be used. ShapeNet has images of over 3 million scales and over 3000 classes, each with data for different objects, different angles, and different lighting. Figure 7 illustrates an intrinsic image in the shareenet dataset.

And 2, step: building generator networks

According to fig. 3 and 4, the method of step 2 is adopted to construct a reflection map generator and an illumination map generator, an encoder of the used reflection map generator U-Net network takes a convolution layer, a batch normalization layer and a LeakyRelu activation function layer as downsampling blocks, the step size of the convolution layer is 2, and the size of a feature map is halved after each convolution operation. The output of each active function layer in the encoder is jumped into a frequency decomposition sub-module, and the channel variation of the encoder is [3,32,64,128,256 ]. The output of the frequency decomposition sub-module is fed to the decoder, which convolution layer has a step size of 1.

The channel compression submodule of the illumination map generator is performed through the convolution layer, the step length of the convolution layer is 1, the size of the characteristic map is not changed, and the number of channels is compressed in different proportions. The high-frequency components in the illumination map are few, and the compression ratio is large; the low frequency component is more, and the compression ratio is small.

And step 3: constructing a network of discriminators

The discriminator is a four-layer convolutional neural network, the convolutional layer has the step size of 2, the channel variation of each convolutional layer is reduced by half after passing through one convolutional layer, the channel variation of the four convolutional layers is respectively 3 to 64, 64 to 128, 128 to 256 and 256 to 512, and the output of each convolutional layer is compressed into a single-channel characteristic probability map. When the discriminator determines true, all the single-channel feature probability maps are close to 1, and when the discriminator determines false, the single-channel feature probability map is close to 0.

And 4, step 4: constructing a loss function

According to the equations (1) - (9), the invention calculates the generator loss function, and the weights of the first layer and the last layer of the generator network in the inherent loss are set to be 4, and the weights of the middle two layers are set to be 1.

When calculating the mean square error, the invention takes the characteristic diagram of the reciprocal 3 layers of the decoder to respectively generate complete, half and quarter original diagrams, and constrains three different scales, and the weights of the three scales are 1, 0.8 and 0.6 respectively.

In order to better maintain the edge characteristics when the cosine loss is calculated, the edges of the generated image and the label image are kept consistent, the input image is divided into 4 blocks, and the cosine similarity of each block and the corresponding label block is ensured to be consistent.

When calculating the discriminator loss function, the weights of the first layer and the last layer are 4, and the weights of the two middle layers are 1.

And 5: network training

And training by using samples of a training image sample library, respectively using different generators and discriminators for a reflection map and an illumination map, and separately training by using network models of the illumination map and the reflection map which are consistent. The network is optimized by adopting an Adam optimization method, different Adam optimizers are needed for a generator and a discriminator, optimizer parameters beta are set to be (0.5,0.999), the learning rate is 0.0005, weight _ decay is 0.0001, and the batch size is 20. The generator and the discriminator employ alternating training (TTUR), the number of trains of the discriminator being 5 to 1 compared to the number of trains of the generator.

Step 6: results and analysis of the experiments

In order to comprehensively evaluate the effect of the algorithm provided by the invention, the effect of the jump connection layer bilateral feature pyramid module is analyzed, and then comparison and evaluation are carried out on an MPI-Sintel essential image data set through two aspects of visualization effect and quantitative index.

(1) Generator and discriminator module evaluation

To gauge the effectiveness of the modules in the generator and discriminator, the present invention evaluates on a complex-context MPI-sinter dataset. Taking an image segmentation mode as an example, a group of comparison experiments are designed to evaluate a jump connection layer bilateral feature pyramid module (BIFPN) of a generator and an adaptive combination module (AC) of an identifier.

(a) Without BIFPN: removing the bilateral characteristic pyramid network of the generator, and directly transmitting the characteristics of the encoder end into the decoder end;

(b) without AC: removing the multi-scale self-adaptive combination module of the discriminator;

(c) a With All: a bilateral feature pyramid module and an adaptive combination module are used simultaneously.

Under the condition that other variables are unchanged, the method trains the three networks, and finally, the result indexes on the MPI-Sintel data set are shown in the table 1.

TABLE 1 Generator and discriminator Module comparison Table (best black)

From table 1, it can be seen that the results using the jump-link layer bilateral feature pyramid module (bippn) and the adaptive combination module (AC) are better in three criteria than those without one of the modules. The structural similarity is improved most, the predicted image is closer to the label, the mean square error is adopted, the higher accuracy is kept in the whole pixel area, and the local mean square error index is improved less. FIG. 8 shows the partial results of comparison experiments for three networks in the MPI-Sintel image segmentation dataset. There are four columns in the figure, the first column being the original, and the second, third and fourth columns corresponding to the three experimental results in table 4-1, respectively. As can be seen from the four graphs in the second column, the network that did not use the bilateral feature pyramid module produced less good results than the fourth column. The second column, first and second row, reflection maps still contain some shadows because the absence of the bilateral feature pyramid module results in incomplete removal of the illumination information. The second column, the third row and the fourth row of the illumination map still appear to contain some texture information, which indicates that the high-frequency characteristic information in the second column is not completely removed. As can be seen from the four graphs in the third column, the network that does not use the multi-scale adaptation module produces no good results as in the fourth column. It is obvious that the four corner areas of the third and fourth images in the third column still have some pixels of the original image. The result of using the multi-scale adaptive combination module is better in both global information recovery and local detail recovery.

According to the indexes and the visualization result, the jump connection layer introduced with the bilateral characteristic pyramid module and the discriminator using the multi-scale self-adaptive combination module have obvious effects on the decomposition of the essential image, obviously improve the effectiveness and consistency of the image, and better reconstruct the local details of the image. After the functions of the bilateral feature pyramid module and the multi-scale self-adaptive combination module are fully explained, the method continues to explore the internal result of the bilateral feature pyramid module. In this regard, the present invention performs modifications based on the original bilateral feature pyramid module to find the best decomposition network. Therefore, the invention provides several different bilateral characteristic pyramid modules, trains the modules, obtains the decomposition result and evaluates the decomposition result.

The first type of network is shown in fig. 9. The network is consistent with a classical bilateral characteristic pyramid network, is also a five-layer network, firstly continuously up-sampling the lowest-frequency characteristic and fusing the characteristic of the second-high frequency to obtain an intermediate characteristic, and then continuously down-sampling the highest-frequency characteristic and fusing the intermediate characteristic to obtain output characteristic information. Wherein, the sub-network blocks are connected in series twice to obtain the final network structure.

A second network is shown in fig. 10. The network is just opposite to the first network structure, and is characterized in that the highest-frequency feature is continuously sampled and fused with the feature of the second lowest frequency to obtain an intermediate feature, and then the lowest-frequency feature is started, and continuously sampled and fused with the intermediate feature to obtain output feature information. The final network structure is also composed of 3 repeating sub-network blocks.

The third and fourth networks are the networks that the present invention ultimately employs, i.e., fig. 3 and 4. These two networks are the evolution of the first two networks, respectively. The ratio of high and low frequency characteristic components in the reflection map and the illumination map is different. The high frequency feature occupancy ratio in the reflection map is high, and the low frequency feature occupancy ratio in the illumination map is high. Therefore, if the high-frequency feature and the low-frequency feature can be separated appropriately, the decomposition effect on the image is improved to some extent. Therefore, the third network is modified from the first network in that two paths for high frequencies are considered as a group, and three paths for low frequencies are considered as a group. The high and low frequency groups independently generate the intermediate features, and the subsequent down-sampling operation combines the two groups. Similarly, the fourth network has undergone some modification relative to the second network. It is also divided into two groups of channels of high frequency and low frequency, and respectively generates intermediate characteristics, and finally connects the output characteristics by means of up-sampling.

Structure 1: the reflection graph end uses a network I + the illumination graph end uses a network I and channel compression;

structure 2: the reflection graph end uses a network two + the illumination graph end uses a network two and channel compression;

structure 3: the reflection graph end uses a network I + the illumination graph end uses a network II and channel compression;

structure 4: the reflection graph end uses network three + the illumination graph end uses network four and channel compression.

Under the condition that other variables are unchanged, the method trains the 4 network combinations on the MPI-Sintel image segmentation data set, and finally the obtained indexes are shown in the table 2.

Table 2 table of comparative experiment results of different jump connection modules

From the above table, it can be seen that structure 2 has a lower mean square error of the reflection map than structure 2 for structure 1, and a lower mean square error of the illumination map than structure 1 for structure 2. Since the reflection map requires a large amount of high-frequency characteristic information. The structure 1 has exactly the path from the down-sampling of the high frequency features to the low frequency features, so that the end of the reflection map is guided by the high frequency components in reconstructing the low frequency components, thereby generating a better reflection map. In the same way, the result of the structure 2 is just opposite to that of the structure 1, and the structure has a path from the up-sampling of the low-frequency feature to the high-frequency feature, so that the illumination pattern end is guided by the low-frequency component when reconstructing the high-frequency component, and the generated illumination pattern contains enough low-frequency feature information, thereby obtaining a better illumination pattern. Then, if the structure at the reflection pattern end in structure 1 and the structure at the reflection pattern end in structure 2 are integrated, the effect is not better than that of structure 1 and structure 2? Based on this hypothesis, the present invention designed structure 3 and performed experiments. Finally, the results of the experiment were consistent with the guess, and the three indices of structure 3 were all better than the first two combinations. Although structure 3 produces better results than structures 1 and 2, the network structure is overly complex. Then, if the effect is ensured, further simplification of the network structure constitutes a subject of the next study. The present invention notes that the essence of the intrinsic image decomposition problem is to distinguish the high frequency features from the low frequency features of the image. It is not very consistent with practical theory for the structure in the first three networks to directly connect all the scale channels. Therefore, the invention tries to modify the jump connection modules of the reflection diagram end and the illumination diagram end, and simply divides all channels into a high-frequency channel and a middle-low frequency channel. As can be seen from fig. 3 and 4, the high frequency channel is the lower two channels, and the medium and low frequency channels are the upper three channels. The intermediate features are only decomposed in the two trunk channels for calculation, and finally the two trunk channels are connected when the output features of each overlapped block are calculated. Experiments prove that the method can simplify the network structure, ensure the decomposition effect and accord with theoretical reality.

(2) Loss function quantitative analysis

In order to evaluate the influence of different loss functions on the final result, the invention still uses the MPI-Sintel data set to train and test, adopts completely same parameters, fixes random seeds, removes the selected loss function, keeps other loss functions unchanged, and designs a plurality of groups of comparison experiments for evaluation.

(a) Without VGG: removing the VGG perception loss;

(b) without Muti-Scale: removing the multi-scale loss;

(c) without Bf: removing double-sideband filtering loss;

(d) without Cos: removing local cosine loss;

the invention performs experiments on the data sets based on image segmentation and scene segmentation, and the specific quantitative indexes are shown in tables 3 and 4.

Table 3 loss function quantization index comparison table under image division

Table 4 loss function quantization index comparison table under scene division

As can be seen from the results in the table, different loss functions have different contributions to the decomposition effect of the intrinsic image. In addition, the data sets of the image segmentation mode and the scene segmentation mode also have influence on the decomposition result. The VGG perceptual loss has the greatest contribution to the final result, because it aims at the loss of the feature space, and can strongly constrain features of different scales. And Cos loss contributes least in the image segmentation dataset, but is second only to VGG perceptual loss in the scene segmentation dataset, which shows that Cos has a greater role in generalized scenes. Multi-scale loss has a contributing role in both image and scene segmentation. Finally, the double-sideband filtering loss also has a promoting effect in graph slicing, but has negative influence on the structural similarity index of the light pattern in the scene slicing, and may be caused by weak self-adaptive constraint generalization of the double-sideband filtering.

(3) MPI-Sintel data set experimental result analysis based on image segmentation

The present invention proposes to use the proposed network in contrast to recent work. In order to ensure accuracy, the invention adopts the same data set and test mode as Fan et al to carry out experiments. The present invention compares the quantization indexes of the previous methods, and the results are shown in table 5. From the aspect of quantization indexes, the method provided by the invention is better than other methods in terms of mean square error indexes, but is poorer in terms of local mean square error and structural similarity of the illumination map.

TABLE 5 quantization index comparison table for each method and method under image division

In terms of visualization results, FIG. 11 shows an exploded view of the essential images of the present invention with methods in the image split dataset. From the results, the Barron method based on the hand-designed features generates an illumination pattern that is overly smooth, with high-frequency features almost lost, and the shadows of the reflection pattern are not completely removed. The Chen method, which is also based on manual features, is a huge deviation in generating colors. Its reflection map is also overly smooth and much of the high frequency detail information is resolved into the illumination map. The methods of 4-5 lines in the figure are based on deep learning, and it can be seen that they are all better than the methods based on manual features. The image decomposed by the MSCR method has a plurality of fuzzy pixel blocks, the smoothness and consistency of the image are poor, and the local details are also poor to restore. Fan's method produces a reflection map that excels in both smoothness and consistency, but leaves room for improvement in the restoration of some local detail. The method is better than the method in image smoothness, texture consistency and detail recovery. In particular, the method of Fan compares the hair of the first column of characters in the figure and the texture features behind the third column of tasks to be closer to the tag, which is less preferred. Although the invention is superior to all other methods in terms of quantization index and visualization result, the distance from the label image is a certain distance. Such as. In the sixth row of the second column, the hair details in the light pattern generated by the present invention still have a certain difference from the label, which is consistent with the results in the table, and thus it is demonstrated that the light pattern end of the present invention does not completely remove the high frequency information such as texture features, and the like, and needs to be improved.

(4) MPI-Sintel data set experiment result analysis based on scene division

The MPI-Sintel data set image decomposition task based on scene division is a very difficult challenge, because the scenes in the test set and the training set are completely different, and the generalization capability of the method in the chapter is tested. In order to ensure accuracy, the invention also adopts the same data set and test mode as Fan et al. The final quantitative index results of the experiment are shown in table 6. It can be seen that the present invention has better performance in all respects than the previous method.

Table 6 scene division, each method and quantization index comparison table of the present invention

In terms of visualization results, FIG. 12 shows an exploded view of other deep learning methods in the scene segmentation data set, along with the essential images of the present invention. It can be seen that the present invention is much better in overall detail and color recovery than previous approaches. FIG. 13 shows three pixel regions with the most distinct differences in the test chart. The first column shows that the characters recovered by the MSCR method are very blurred and the details are very unclear, while the Fan method has problems with color recovery and skin and clothing are not smooth enough, and the present invention recovers well in the details of skin and clothing. The canvas in the second row shows that MSCR is still fuzzy, has better color but completely lost details, Fan's method has better details but poor color recovery, the canvas is also doped with other gray features, and the canvas recovered by the present invention has normal color and very clear details around the canvas. The third column of walls is similar to the two columns above, the invention is more edge-on and color-recovery is good. The invention is superior to all comparative algorithms by combining the visualization effect and the specific quantitative index.

(5) ShapeNet data set experimental result analysis

In order to verify the universality of the method provided by the invention on different data sets, the invention also trains and tests the ShapeNet data set. The invention selects partial subsets in the ShapeNet data set, which have 100k pieces of image data, the proportion of the training set to the testing set is 9:1, the used method and parameters are completely consistent with the MPI-Sintel data set, the specific quantitative indexes are shown in table 7, and the visual effect graph of ShapeNet is shown in fig. 14. The method provided by the invention can be seen to obtain the best effect on a ShapeNet data set, the promotion range is large, the visualization effect is very close to that of a label, and the method is proved to have a good result not only in a complex scene, but also in a data set based on a single object.

TABLE 7ShapeNet data set each method and quantization index comparison table of the invention

Claims

1. The invention provides an essential feature image decomposition method based on bilateral feature pyramid and multi-scale identification, wherein a network structure used by the method is mainly divided into 2 parts: a generator network and a discriminator network.

(1) Generator network

The generator network of the present invention comprises a reflection map generator and a light map generator. The reflection pattern generator and the illumination pattern generator differ only in structure in the hop-connect channel. The main network of the generator adopts a U-Net network structure as a template, the number of the main network is 5, and the channels corresponding to each convolution layer are 16, 32,64,128 and 256 in sequence. For the jump connection channel, the invention adopts a vertically symmetrical structure. The reflection diagram generator calculates the intermediate characteristic from low frequency to high frequency, and then calculates the output characteristic from high frequency to low frequency by combining the intermediate characteristic. The illumination map generator calculates the intermediate features from high frequency to low frequency, and then calculates the output features from low frequency to high frequency by combining the intermediate features. The loss function of the generator network of the present invention is shown in equation (1).

L_G＝L_GAN-G+L_mse+L_cos+L_bf+L_feat (1)

Intrinsic loss function L_GAN-GIs shown in formula (2):

Mean square error function L_mseThe calculation formula (2) is shown in formula (3):

Cosine loss function L_cosThe calculation formula (2) is shown in formula (4):

wherein, the fake _ region_iThe ith block region, true _ region, representing the generated image_iAn ith block region representing a generated image;

cross bilateral filtering loss L_bfThe calculation formula (2) is shown in formulas (5) to (7):

wherein L is_bfRepresenting double sideband filtering loss, bf double sideband filtering, C label image, { a, S } respectively reflection and illumination maps, J_pRepresenting the output of the bilateral filter, C_pValue, N, representing the p-th pixel of the label image_pDenotes the total number of p pixels and neighboring pixels, W_pQ table representing normalized weightsP, N (p) represents a set of neighbor pixel locations for the p-th pixel,

denotes the range Gaussian kernel, C_qRepresents the value of the neighbor pixel q;

L_featthe calculation formula (c) is shown in formula (8):

where l denotes the l-th layer of the VGG network, F_lNumber of channels, H, representing layer I characteristic diagram_lDenotes the height of the ith layer profile, W denotes the width of the ith layer profile,

representing a characteristic activation value of the l-th layer;

(2) discriminator network

The discriminator consists of four layers of convolutional neural networks; when the reflection map generator or the illumination map generator is trained, the reflection map or the illumination map output by the reflection map generator or the illumination map generator is input into the discriminator, the discriminator compares the input reflection map or the illumination map with the label image, and the probability that the reflection map or the illumination map is consistent with the label image is output;

the reflection map generator is used in combination with a discriminator to train the reflection map generator. The illumination pattern generator is used in combination with a discriminator to train the illumination pattern generator. The invention trains the generator network by using MIT and MPI data sets, updates the network parameters by adopting an SGD optimization method, and stops training when the loss function value in the formula 1 is minimum to obtain the final trained network. The trained network can perform eigen decomposition on the input image to obtain the most appropriate reflection image and illumination image.

The discriminator loss is defined as shown in equation (9):