CN108764250B

CN108764250B - Method for extracting essential image by using convolutional neural network

Info

Publication number: CN108764250B
Application number: CN201810407424.8A
Authority: CN
Inventors: 蒋晓悦; 冯晓毅; 李会方; 吴俊�; 何贵青; 谢红梅; 夏召强
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2018-05-02
Filing date: 2018-05-02
Publication date: 2021-09-17
Anticipated expiration: 2038-05-02
Also published as: CN108764250A

Abstract

The invention provides a method for extracting an essential image by using a convolutional neural network. Firstly, constructing a double-current convolution network with a parallel structure from an image to an image; then, the network is trained by adopting a specific training data set, network parameters are optimized, so that multi-layer features with environment invariance are extracted, and essential images (a reflection map and a light map) are directly reconstructed. Because the double-current convolutional neural network constructed based on the deep learning theory has strong feature extraction capability, a reflection map and a light map can be directly separated from an original image. Meanwhile, the model is a full convolution network model from an image to an image, comprises two branch flow directions and is respectively used for generating a light map and a reflection map, and the network structure combines a convolution result of a higher layer with a result after deconvolution operation, so that the reconstruction errors of the light map and the reflection map can be reduced to a certain extent, and the capability of network feature reconstruction is improved.

Description

Method for extracting essential image by using convolutional neural network

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a method for extracting an essential image by using a convolutional neural network.

Background

Understanding and analyzing images is one of the most fundamental tasks of image processing. Because the image imaging process is affected by various factors such as the characteristics of the target object, the shooting environment, the acquisition equipment and other conditions, the interference of the factors, such as shadow, color discontinuity, target posture transformation and the like, needs to be fully considered in the image processing process. These variable factors bring great challenges to the image processing algorithm, so that the performance of the existing image analysis algorithm is greatly influenced in a complex environment. Therefore, how to improve the robustness of the image analysis algorithm under a complex environment has become a hot research focus in recent years. In fact, if the essential features in the image can be analyzed based on the existing observation image, the above problems encountered in the image analysis process can be solved well. The essential features are features inherent to the target object, which are not related to the surrounding environment, that is, reflection characteristics (including information such as color and texture) of the object and shape characteristics of the object. Although the two intrinsic characteristics of the target object do not change with the change of the surrounding environment, the acquired observation image of the target object is affected by the environment. Therefore, if the essential characteristics of the object can be directly analyzed from the observation image, the information of the inherent shape, color, texture and the like of the object is extracted, and the influence of environmental change on the image is eliminated, so that the image can be more accurately understood, and a more reliable information basis is provided for further realizing robust image analysis in a complex environment.

The existing algorithms can be divided into three categories according to the way of extracting the essential features of the target: one type of algorithm is an implicit essential feature analysis algorithm, that is, a multi-mode (the appearance of the target object under different lighting conditions and different postures) of the target object is learned through a pattern recognition algorithm. In the learning process, the algorithm does not particularly consider the internal connection among various expressions, but directly performs pattern analysis on various observed results so as to try to obtain the distribution function of the target in the feature space. One serious problem encountered by this type of algorithm is the generalization of the objective description function. That is, the distribution of the training samples seriously affects the distribution function finally learned. If the learning sample is only the image of the target object in a single illumination state or posture, the training learning result is difficult to be popularized to the image of the target object in different illumination states or new postures. Therefore, in the case of imperfect samples, the algorithm is difficult to be generalized to the case of the target object in various complex environments. Another class of algorithms is explicit intrinsic feature analysis algorithms. The algorithm analyzes the internal relation of the target object according to the appearance of the target object in different states. Compared with the first kind of implicit algorithm, the algorithm directly analyzes the reflection characteristics and the shape of the object according to the physical imaging principle and the prior knowledge of the reflection characteristics and the shape. Therefore, the image of the target object in the new state can be directly calculated according to the inherent characteristics. Therefore, the result obtained by the explicit analysis algorithm is more accurate and has popularization. However, the algorithm usually adopts the constraint based on structure, texture, color and the like to realize the estimation of essential features, i.e. the signal classification problem is converted into the energy optimization problem by following the Retinex theoretical framework, and the calculation analysis is completed under a single scale. Therefore, the accuracy of the analysis result depends on the performance of the optimization algorithm to a great extent, and meanwhile, the optimal solution cannot be obtained because the process of solving the convexity of the established function to be optimized cannot be guaranteed to be trapped in a local minimum value, or the initialization step needs to be as close to the optimal solution as possible. These limit the performance of this type of algorithm. Still another algorithm is to extract the intrinsic images by using a neural network based on deep learning, and predict the intrinsic images directly from an RGB image by training a convolutional neural network. However, the network structure of the existing algorithm is simple, and the training set is an artificial image synthesized by computer graphics software, so that the extracted essential image is not very clear, especially when the method is applied to a natural image.

Disclosure of Invention

The invention provides a method for extracting an essential image by using a convolutional neural network, aiming at overcoming the defects of insufficient feature extraction capability of the existing implicit and explicit essential feature analysis algorithms and the problems that the deep learning-based neural network algorithm mainly aims at artificial images and the like. Firstly, a double-current convolution network with a parallel structure from an image to an image is constructed, then the network is trained, network parameters are optimized, so that multilayer features with environment invariance are extracted, and an essential image (a reflection map and a light map) is directly reconstructed. By adopting a multi-stream structure, on one hand, tasks can be separated, and different characteristics can be extracted by different shunts; on the other hand, the two are mutually limiting conditions, and the algorithm precision can be improved.

A method for extracting essential images by using a convolutional neural network is characterized by comprising the following steps:

step 1: and constructing a double-current convolutional neural network structure model with a parallel structure, wherein the network structure model is divided into a public branch and two special branches.

Wherein the common branch is composed of 5 convolutional layers, and each convolutional layer is followed by a pooling layer. Convolution kernels of the convolutional layers are all 3 x 3, each layer outputs a characteristic image, the dimensionality of the output characteristic image of the first convolutional layer is 64, the dimensionality of the output characteristic image of the second convolutional layer is 128, the dimensionality of the output characteristic image of the third convolutional layer is 256, the dimensionality of the output characteristic images of the fourth convolutional layer and the dimensionality of the output characteristic image of the fifth convolutional layer are both 512, and the pooling layer adopts average pooling with the size of 2 x 2.

The two special branches have the same structure and respectively comprise 3 deconvolution layers, convolution kernels are 4 multiplied by 4, one branch is used for reconstructing an illumination image, the other branch is used for reconstructing a reflection image, and the output dimensionality of all the deconvolution layers is 256.

The characteristic image output by the third convolution layer of the public branch and the output of the second deconvolution layer of the special branch are used as the input of the third deconvolution layer of the special branch; the characteristic image output by the fourth convolution layer of the public branch and the output of the first deconvolution layer of the special branch are used as the input of the second deconvolution layer of the special branch.

Step 2: and constructing a training data set, intercepting an image with the size of 1280 multiplied by 1280 from the middle part of each image of the BOLD data set, and dividing the intercepted image into five equal parts on a row and a column respectively, so that 25 images with the size of 256 multiplied by 256 are obtained from each image in the original data set, 53720 groups of images are randomly extracted to form a test set, and the rest images form the training set.

And step 3: and (3) training the double-current convolutional neural network constructed in the step (1) by using the training set obtained in the step (2), firstly performing random initialization on weights of all layers of the network, and then training the network by adopting a supervised error back propagation training method to obtain a trained network. Wherein the basic learning rate of the network is 10^-13And adopting a fixed learning rate strategy, wherein the batch size of the network is 5, the loss function is SoftmaxWithLoss, and the network convergence condition is that the difference of the loss function values of the two iterations is within +/-5% of the value.

And 4, step 4: and (3) processing the test set obtained in the step (2) by using a trained network to obtain extracted essential images, namely a light map and a reflection map.

The method is also tested on the MIT Intrinsic Images dataset of the common data set of the Intrinsic Images, and the result shows that the method still has effectiveness.

The invention has the beneficial effects that: due to the adoption of the technical route of the intrinsic image extraction based on the deep learning theory, the reflection map and the illumination map can be directly separated from the original image by using the strong feature extraction capability of the neural network constructed based on the deep learning theory. In addition, the double-current convolution neural network provided by the invention is a full convolution network model from an image to an image, and comprises two branch flow directions which are respectively used for generating a light map and a reflection map; in addition, the network structure combines the convolution result of a higher layer with the result after the deconvolution operation, so that the details of the characteristic diagram after the deconvolution operation can be enhanced, the reconstruction errors of the illumination diagram and the reflection diagram are reduced to a certain extent, and the capability of reconstructing the network characteristics is improved.

Drawings

FIG. 1 is a flow chart of a method for extracting an intrinsic image using a convolutional neural network according to the present invention

FIG. 2 is a diagram of a dual-flow convolutional neural network structure constructed by the present invention

FIG. 3 is an example of a data set partial image constructed by the present invention

Detailed Description

The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.

The invention provides a method for extracting an essential image by using a convolutional neural network, which comprises the following main processes as shown in figure 1:

1. construction of double-current convolution neural network structure model with parallel structure

The reconstruction of the image is actually done by giving different weights to the features extracted from the image and combining the same type of features to accomplish the goal of reconstructing the light and reflection maps from the original image. In other words, all required features are present in the same original image. Thus the feature extraction part can be shared, and the reconstruction of two different types of essential images needs to be done separately. Thus, the network constructed by the invention is divided into two parts, namely a public branch and a private branch. After the convolution operation of the common branch, the size of the characteristic graph output by each layer is gradually reduced. In order to keep the input image and the output image in the same size in space structure, three deconvolution layers are respectively designed on two special branches, so that the space size of the feature map is gradually restored to the original size. The invention is inspired by a residual error network structure, and the invention finds that the combination of the later two layers in the public branch and the later two layers in the special branch can lead the network parameters to obtain better optimization effect in the experimental process. For the above reasons, the present invention constructs a dual-stream convolutional neural network structure having a parallel structure as shown in fig. 2. The network model is divided into a public branch and two special branches.

Wherein the common branch is composed of 5 convolutional layers, and each convolutional layer is followed by a pooling layer. Convolution kernels of the convolutional layers are all 3 x 3, each layer outputs a characteristic image, the dimensionality of the output characteristic image of the first convolutional layer is 64, the dimensionality of the output characteristic image of the second convolutional layer is 128, the dimensionality of the output characteristic image of the third convolutional layer is 256, the dimensionality of the output characteristic images of the fourth convolutional layer and the dimensionality of the output characteristic image of the fifth convolutional layer are both 512, and the pooling layer adopts average pooling with the size of 2 x 2. The two special branches have the same structure and respectively comprise 3 deconvolution layers, convolution kernels are 4 multiplied by 4, one branch is used for reconstructing an illumination image, the other branch is used for reconstructing a reflection image, and the output dimensionality of all the deconvolution layers is 256. The feature image output by the third convolution layer of the public branch and the output of the second deconvolution layer of the special branch are used as the input of the third deconvolution layer of the special branch together; the feature image output by the fourth convolution layer of the public branch and the output of the first deconvolution layer of the private branch are used together as the input of the second deconvolution layer of the private branch.

2. Data set construction

The network structure provided by the invention is more complex, and more network parameters need to be trained. In order to make the network exert the optimal performance, the invention constructs a data set for researching the Intrinsic Image Extraction algorithm on the basis of a BOLD data set (Jiang X Y, Schofield A J, Wyatt J L. correlation-Based Intrinsic Image Extraction from a Single Image [ C ]. European Conference on Computer Vision,2010:58-71) created by Jiang et al. The data set contains 268,600 sets of pictures, each set containing an original picture, an illumination map and a reflection map. 53,720 groups were randomly drawn from the test set to test the performance of the intrinsic image extraction algorithm. The remaining 214,880 groups constitute a training set for training the deep-learning neural network. The BOLD database contains a large number of high resolution image sets, all of which are objects photographed under well-tuned lighting conditions, mainly including various complex patterns, faces, and outdoor scenes, and fig. 3 gives an example of a data set image. The purpose of Jiang et al in constructing this database is to provide a test platform for image processing algorithms. Specifically, the method mainly comprises an essential image extraction algorithm, a delumination algorithm, a light source estimation algorithm and the like. To this end, they provide a map of the lighting conditions and a map of the object surface, i.e. a map of the lighting and a map of the reflection, and are both standard RGB color space pictures with linear luminance characteristics. According to the invention, the number of pictures, the quality of the pictures, the complexity of a scene and the like are comprehensively considered, and finally, a picture group taking a complex pattern as a shooting object is selected to construct a data set for the research, wherein the original picture has 1280 pixel points in each dimension in the transverse direction and 1350 pixel points in the longitudinal direction, so that the data size is too large for a common computer, and the problem of over-learning is easily caused, which is very unfavorable for the training of a deep learning neural network. The image category selected by the invention has a very obvious characteristic: the key information is concentrated in the middle of the image. Therefore, the invention selects a feature frame of 1280 × 1280 in the middle of the image, intercepts the original image, and then divides the image into five equal parts on the row and the column respectively. Thus, an original image can be divided into 25 256 × 256 smaller images. By cutting the original image in the mode, key information in the original image is reserved, data utilization maximization is achieved, and meanwhile, various convenience conditions are provided for the research: the data volume of each group of pictures is moderate, and the requirements on the performance of a computer are not high; the image size is relatively reasonable, and the convolutional neural network is more convenient to design; the cut image contains positive and negative samples at the same time, so that overfitting can be avoided to a certain extent.

3. Network training

The embodiment trains the constructed deconvolution neural network under the Caffe framework using the training set in the dataset created based on BOLD. Compared with other frameworks, the Caffe framework is not only simple and convenient to install, but also supports all operating systems, and has good interface support for Python and Matlab. Because the constructed network structure is relatively complex, the data amount required to be learned is large, the number of times of iteration required by the network is large, and meanwhile, the optimal solution is prevented from being missed too fast in the network learning process, the basic learning rate is set to be 10 through repeated experiments in the network training process^-13The learning rate policy is set to "fixed," i.e., a fixed learning rate. Considering computer performance, and also to avoid too fast network convergence, the batch size of the network is set to 5 and the loss function is SoftmaxwithLoss.

The loss function is used for calculating the difference value between the output result and the real label, and the network loss is smaller and smaller along with the increase of the iteration times, namely the estimation result is closer to the real label. The loss function in a single dimension can be written as:

wherein { (x)¹,y¹),...,(x^m,y^m) Denotes m sets of labeled training data, x denotes input, y denotes corresponding label, and y ∈ [0,255 }]. 1{ F } is an indication function, the function value is 1 when F is true and 0 when F is false, and theta represents a parameter of the convolutional neural network. During the training process, the error between the estimated image and the real label is propagated back to the neural network to optimize the parameters thereof, so that the error is gradually reduced. For an RGB image, the loss function is the sum of the errors of the above loss function in the three dimensions of the image R, G, B.

After 210,000 iterations, the difference between the front and the back of the loss function value fluctuates within the range of +/-5%, the network gradually converges, namely the network parameters tend to be optimal under the current network structure, the capability of the network for extracting the essential image tends to be optimal, and the trained network is obtained. Although the two proprietary branches look the same in structure, the network parameters they learn during network training will be different due to the different real labels provided to them. Therefore, when the network is used for extracting the essence images, the operation of the network on the data has corresponding difference, so that different branches can extract different types of essence images.

4. Extracting essential images by using trained network

And (3) processing the test set contained in the data set established in the step (2) by using a trained network, namely converting RGB pictures contained in the test set into a three-dimensional matrix as the input of the network, and obtaining extracted essential images, namely a light map and a reflection map, after multilayer operation of the network. The method is also tested on the MIT Intrinsic Images dataset of the common data set of the Intrinsic Images, and the result shows that the method still has effectiveness.

Claims

1. A method for extracting essential images by using a convolutional neural network is characterized by comprising the following steps:

step 1: constructing a double-current convolutional neural network structure model with a parallel structure, wherein the network structure model is divided into a public branch and two special branches;

wherein, the public branch is composed of 5 convolution layers, and each convolution layer is followed by a pooling layer; convolution kernels of the convolutional layers are all 3 multiplied by 3, each layer outputs a characteristic image, the dimensionality of the output characteristic image of the first convolutional layer is 64, the dimensionality of the output characteristic image of the second convolutional layer is 128, the dimensionality of the output characteristic image of the third convolutional layer is 256, the dimensionality of the output characteristic images of the fourth convolutional layer and the dimensionality of the output characteristic image of the fifth convolutional layer are both 512, and the pooling layers are subjected to average pooling with the size of 2 multiplied by 2;

the two special branches have the same structure and respectively comprise 3 deconvolution layers, convolution kernels are 4 multiplied by 4, one branch is used for reconstructing an illumination image, the other branch is used for reconstructing a reflection image, and the output dimensionality of all the deconvolution layers is 256;

the characteristic image output by the third convolution layer of the public branch and the output of the second deconvolution layer of the special branch are used as the input of the third deconvolution layer of the special branch; the characteristic image output by the fourth convolution layer of the public branch and the output of the first deconvolution layer of the special branch are used as the input of the second deconvolution layer of the special branch;

step 2: constructing a training data set, intercepting an image with the size of 1280 multiplied by 1280 from the middle part of each image of the BOLD data set, and dividing the intercepted image into five equal parts on a row and a column respectively, so that 25 images with the size of 256 multiplied by 256 are obtained from each image in the original data set, 53720 groups of images in the original data set are randomly extracted to form a test set, and the rest images form a training set;

and step 3: training the double-current convolutional neural network constructed in the step 1 by using the training set obtained in the step 2, firstly randomly initializing weights of all layers of the network, and then training the network by adopting a supervised error back propagation training method to obtain a trained network; wherein the basic learning rate of the network is 10^-13By usingThe fixed learning rate strategy is characterized in that the batch size of the network is 5, the loss function is SoftmaxWithLoss, and the network convergence condition is that the difference of the loss function values of the two iterations is within +/-5% of the value;