CN109635882B

CN109635882B - Salient object detection method based on multi-scale convolution feature extraction and fusion

Info

Publication number: CN109635882B
Application number: CN201910062293.9A
Authority: CN
Inventors: 牛玉贞; 龙观潮; 郭文忠; 苏超然
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2022-05-13
Anticipated expiration: 2039-01-23
Also published as: CN109635882A

Abstract

The invention relates to a salient object detection method based on multi-scale convolution feature extraction and fusion, which comprises the steps of firstly enhancing data, simultaneously processing a color image and a corresponding artificial labeling image, and increasing the data volume of a training data set; extracting multi-scale features, and performing channel compression to optimize the computing efficiency of the network; then, fusing multi-scale features to obtain a predicted saliency map; finally, the optimal parameters of the model are learned by solving the minimum cross entropy loss; and finally, predicting the salient objects in the image by using the trained model network. The invention can obviously improve the detection precision of the obvious object.

Description

Salient object detection method based on multi-scale convolution feature extraction and fusion

Technical Field

The invention relates to the field of image processing and computer vision, in particular to a salient object detection method based on multi-scale convolution feature extraction and fusion.

Background

How to fuse various scale convolution characteristics in a full convolution network is an open problem in the field of salient object detection. Starting from this problem, most existing salient object detection methods based on full convolution neural networks generally enable convolution features of different scales to be fused via branches by adding network branches, thereby generating features more beneficial to the salient object detection task. Salient object detection algorithms proposed after 2015 mostly focus on applying a Full Convolutional Neural Network (FCNN) to improve the computational efficiency of the network and the accuracy of salient object detection.

The work can be divided into two types, firstly, the innovation of a full convolution network structure is realized, Li and the like obtain characteristics of different scales on a pre-trained VGG-16 network, the characteristics of each scale are subjected to convolution calculation to obtain a new characteristic result, the characteristics are restored to be uniform in size through an up-sampling operation, finally, a significant detection result is obtained through the convolution operation, and a branch with a super-pixel scale is fused to optimize the final significant object detection result from a spatial scale. The salient object network proposed by Wang et al is a full convolution neural network in the form of an encoder-decoder, and a recurrent neural network structure is added to continuously iteratively optimize the detection result of the salient object. Cheng et al adds a short connection structure (short connection structure) to the full-volume network, and because each output branch in the short connection structure fuses high-level semantic information and low-level features such as texture and shape, the performance of the algorithm is significantly improved while keeping the model simple and efficient.

However, most methods use a pre-trained feature network on a classification task to fuse convolution features of corresponding different scales in the network, and the scales of the features are generally limited and fixed.

Disclosure of Invention

In view of this, the present invention provides a method for detecting a salient object based on multi-scale convolution feature extraction and fusion, which can significantly improve the detection accuracy of the salient object.

The invention is realized by adopting the following scheme: a salient object detection method based on multi-scale convolution feature extraction and fusion specifically comprises the following steps:

step S1: data enhancement is carried out, meanwhile, the color image and the corresponding artificial labeling graph are processed, and the data volume of the training data set is increased;

step S2: extracting multi-scale features and performing channel compression to optimize the computing efficiency of the network;

step S3: fusing multi-scale features to obtain a predicted saliency map Predⁱ；

Step S4: learning the optimal parameters of the model by solving the minimum cross entropy loss; and finally, predicting the salient objects in the image by using the trained model network.

Further, step S1 specifically includes the following steps:

step S11: scaling each color image in the data set together with the corresponding artificial label graph so that the computing equipment can bear the calculated amount of the neural network;

step S12: carrying out random cutting operation on each color image in the data set and the corresponding manual labeling graph together so as to increase the diversity of data;

step S13: and generating a mirror image by horizontally turning the image to enlarge the data volume of the original data set.

Further, step S2 specifically includes the following steps:

step S21: the inherent network structure of U-Net is improved, wherein the encoder structure of the U-Net network takes an image classification convolution network as a feature network, convolution layers and a pooling layer are combined by continuous stacking to generate convolution features of 5 different scales, and the convolution feature EnⁱAnd convolution feature Enⁱ⁺¹There is a pooling layer to gradually reduce the size of the feature map, the pooling layer is set to a step size of 2, such that Enⁱ⁺¹Compare EnⁱThe feature result is reduced by half in two spatial dimensions of width and height; in order to maintain the convolution features and have enough information in the spatial dimension, the step size of the pooling layer between the last two convolution features is made to be 1, so that the last two convolution features maintain consistent size in both the width and height spatial dimensions;

step S22: designing a multi-scale feature extraction module to act on the convolution feature of each scale generated by the improved U-Net network in the step S21 to obtain multi-scale content features;

step S23: a channel compression module is added to act on the multi-scale content characteristics to optimize the computing efficiency of the network.

Further, step S22 specifically includes the following steps:

step S221: designing three convolution layers with convolution characteristics EnⁱAs input, these three convolutions are all performed by a depth separable hole convolution operation, in which the expansion coefficients of the hole convolutions are 3, 6, 9, respectively; the feature result and convolution feature En obtained by the three operationsⁱThe feature sizes of the two-dimensional image are kept consistent, and the feature sizes are all (c, h, w);

step S222: splicing the three characteristic results together on the channel dimension by applying connection operation and obtaining the characteristic result with the characteristic size of (3c, h, w);

step S223: applying a convolution operation with convolution kernel size of (1,1) to compress the channel of the feature result obtained in step S222 to the convolution feature EnⁱAnd (5) conforming and obtaining the multi-scale content features with the feature size of (c, h, w).

Further, step S3 specifically includes the following steps:

step S31: designing a multi-scale feature fusion module, and setting input multi-scale content features FeatⁱHas a characteristic size of (c, h, w); in the multi-scale feature fusion module, the depth separable convolution operation with convolution kernel sizes of (1, k) and (k,1) and the depth separable convolution operation with convolution kernel sizes of (k,1) and (1, k) are respectively applied to obtain the Feat with the input featureⁱFeature fusion results with consistent sizes;

step S32: the decoder structure of the U-Net network and the characteristic network of the encoder correspond to 5 characteristic results with different scales, and the convolutional characteristic Dec of each scale is generated by the decoder structure of the U-Net networkⁱThe multi-scale content feature Feat is fused by applying a multi-scale feature fusion moduleⁱAnd convolution characteristics Decⁱ⁺¹Here, it is assumed that the convolution characteristic Dec of the inputⁱ⁺¹The characteristic size of (c, h/2, w/2); first, the convolution feature Dec is alignedⁱ⁺¹Applying the upsampling operation to magnify twice in the spatial dimension, thereby convolving the features Decⁱ⁺¹With multi-scale content features FeatⁱThe sizes are the same in spatial dimension, and the characteristic size is (c, h, w); then the multi-scale content feature FeatⁱAnd convolution characteristics Decⁱ⁺¹Obtaining spliced features with the feature size of (2c, h, w) by applying splicing operation, and obtaining a feature result with the feature size of (c, h, w) through a ReLU activation function and a BN layer by applying convolution operation; then, a multi-scale feature fusion module is applied to the obtained feature result to obtain a feature fusion result, meanwhile, the feature result and the feature fusion result are subjected to splicing operation and convolution operation, and a feature result Dec with a feature size of (c, h, w) is obtained through a ReLU activation function and a BN layerⁱ(ii) a Finally, the convolution operation with convolution kernel size of (1,1) is applied to convert the characteristic result DecⁱThe number of channels is reduced by half in order to be matched with Dec^i-1Fusing, and obtaining a characteristic result Dec with the characteristic size of (0.5c, h, w) through a ReLU activation function and a BN layerⁱCompressing the channel to 1 through convolution operation, and obtaining the predicted saliency map Pred through Sigmoid functionⁱ。

Further, step S31 specifically includes the following steps:

step S311: successively pair input multi-scale content features FeatⁱApplying a deeply separable convolution operation with convolution kernels of (1, k) and (k,1) in order to successively align the input features FeatⁱApplying the convolution operation with separable depths of convolution kernels (k,1) and (1, k), wherein after the two successive operations, a BN layer is added and two characteristic results are obtained respectively;

step S312: summing the two feature results according to the channel dimension to obtain and input feature FeatⁱCharacteristic results with consistent sizes;

step S313: modeling the features on the channel of the feature result by applying a convolution operation with convolution kernel size of (1,1) and obtaining the feature Feat corresponding to the input featureⁱAnd fusing the results of the features with the same size.

Further, in step S4, the cross entropy Loss is calculated by the following formula:

compared with the prior art, the invention has the following beneficial effects: the invention provides a multi-scale feature extraction module and a multi-scale fusion module, wherein the module is directly embedded into a U-Net network architecture of a typical encoder-decoder structure in network design, and meanwhile, the redundancy of information on a feature channel on the decoder structure is also considered, and a channel compression module is applied to ensure that the model calculation efficiency is higher. The invention can obviously improve the detection precision of the obvious object.

Drawings

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention.

Fig. 2 is a structure diagram of a salient object detection network according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a multi-scale feature extraction module according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a channel compression module according to an embodiment of the invention.

Fig. 5 is a schematic diagram of a multi-scale feature fusion module according to an embodiment of the present invention.

Fig. 6 is a schematic network structure diagram of a multi-scale feature fusion process according to an embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present embodiment provides a method for detecting a salient object based on multi-scale convolution feature extraction and fusion, which specifically includes the following steps:

In this embodiment, the step S1 performs data enhancement, and simultaneously processes the color image and the corresponding artificial label graph, so as to increase the data amount of the training data set. The international mainstream data set used for training a salient object detection network is generally a binary image containing a color image and a corresponding artificial label map, wherein the color image is shown in fig. 2 (a), and the artificial label map is similar to the salient map (e.g., fig. 2 (b)) and is an image salient object region marked by an artificial. Because the construction of the data set needs to consume large manpower, and the training of the deep neural network needs to require enough data, the data enhancement operation needs to be performed on the basis of the data volume of the original data set. Therefore, step S1 specifically includes the following steps:

step S13: and generating a mirror image by horizontally turning the image to enlarge the data volume of the original data set so as to meet the requirement of training a deep Convolutional Neural Network (CNN) on larger data volume and enhance the generalization capability of the model.

In this embodiment, step S2 specifically includes the following steps:

step S21: the inherent network structure of U-Net is improved, wherein the encoder structure of the U-Net network takes an image classification convolution network as a feature network (such as network structures of VGG or ResNet) and generates convolution features with 5 different scales by continuously stacking and combining convolution layers and pooling layers, such as En in FIG. 2¹，En²，En³，En⁴And En⁵Corresponding five characteristic results. Of these five convolution characteristics, in the convolution characteristic EnⁱAnd convolution features Enⁱ⁺¹There is a pooling layer to gradually reduce the size of the feature map, the pooling layer is set to a step size of 2, such that Enⁱ⁺¹Compare EnⁱThe feature results are reduced by half in both the width and height spatial dimensions, which also results in attenuation of the information of the convolution features in the spatial dimension; in order to preserve the convolution characteristics and to have enough information in the spatial dimension, the last two convolution characteristics (En) are made⁴And En⁵) The step size of the pooling layer in between is 1, so that the last two convolution features (En)⁴And En⁵) The size is kept consistent in both the width and height spatial dimensions;

step S22: designing a multi-scale feature extraction module to act on the convolution feature of each scale generated by the improved U-Net network in the step S21 to obtain multi-scale content features; the multi-scale feature extraction module is shown in fig. 3, and here, the feature size of the convolution feature is assumed to be (c, h, w);

step S23: a channel compression module is added to act on the multi-scale content characteristics to optimize the computational efficiency of the network. The channel compression Module is shown in FIG. 4, where "SE Module" is the Module proposed by Hu et al in the SENet (Squeeze-and-Excitation Networks) paper. SE module characterization Feat with multi-scale contentⁱAs an input, the feature generalization capability is made stronger by modeling the correlation between features on each channel and performing a weighting operation. Then the channel compression module applies a convolution operation with convolution kernel size of (1,1) to compress the channel number of the characteristic result to half of the original channel number, and then the channel number is subjected to a Recirculation Linear Unit (LU) function to obtain a characteristic resultAnd BN (Batch Normalization) layer to obtain multi-scale content characteristic Feat after channel compressionⁱ。

In this embodiment, step S22 specifically includes the following steps:

step S221: designing three convolution layers with convolution characteristics EnⁱAs input, these three convolutions are all performed by a depth separable hole convolution operation, in which the expansion coefficients of the hole convolutions are 3, 6, 9, respectively; and setting different expansion coefficients of the cavity convolution can enable the convolution operation to capture the content area characteristics with different sizes on the image, namely generating the characteristic result of the multi-scale content area. The feature result and convolution feature En obtained by the three operationsⁱThe feature sizes of the two-dimensional image are kept consistent, and the feature sizes are all (c, h, w);

step S222: splicing the three characteristic results together on the channel dimension by applying a connection operation (concatee) to obtain a characteristic result with a characteristic size of (3c, h, w);

step S223: applying a convolution operation with convolution kernel size of (1,1) to compress the channel of the feature result obtained in step S222 to the convolution feature EnⁱThe multi-scale content features with feature size (c, h, w) are consistent and obtained, such as Feat in FIG. 4ⁱ。

In this embodiment, step S3 specifically includes the following steps:

step S31: in order to fuse features of different sizes, the present embodiment designs a multi-scale feature fusion module, as shown in fig. 5, which sets an input multi-scale content feature FeatⁱHas a characteristic size of (c, h, w); in the multi-scale feature fusion module, the depth separable convolution operation with convolution kernel sizes of (1, k) and (k,1) and the depth separable convolution operation with convolution kernel sizes of (k,1) and (1, k) are respectively applied to obtain the Feat with the input featureⁱFeature fusion results with consistent sizes; this is equivalent to the convolution operation of (k, k), but requires less computational resources while being able to stitch content region features of different scales from the spatial dimension;

step S32: the decoder structure of the U-Net network and the characteristic network of the encoder are respectively corresponding to 5 characteristic junctions with different scalesConvolution characteristics Dec for each scale generated by decoder structure of U-Net networkⁱThe multi-scale content feature Feat is fused by applying a multi-scale feature fusion moduleⁱAnd convolution characteristics Decⁱ⁺¹Here, it is assumed that the convolution characteristic Dec of the inputⁱ⁺¹The characteristic size of (c, h/2, w/2); first, the convolution feature Dec is alignedⁱ⁺¹Applying the upsampling operation to magnify by a factor of two in the spatial dimension, thereby convolving the features Decⁱ⁺¹With multi-scale content features FeatⁱThe sizes are the same in spatial dimension, and the characteristic size is (c, h, w); then the multi-scale content feature FeatⁱAnd convolution characteristics Decⁱ⁺¹Obtaining spliced features with the feature size of (2c, h, w) by applying splicing operation, and obtaining a feature result with the feature size of (c, h, w) through a ReLU activation function and a BN layer by applying convolution operation; then, a multi-scale feature fusion module is applied to the obtained feature result to obtain a feature fusion result, meanwhile, the feature result and the feature fusion result are subjected to splicing operation and convolution operation, and a feature result Dec with a feature size of (c, h, w) is obtained through a ReLU activation function and a BN layerⁱ(ii) a Finally, the convolution operation with convolution kernel size of (1,1) is applied to convert the characteristic result DecⁱThe number of channels is reduced by half in order to be matched with Dec^i-1Fusing, and obtaining a characteristic result Dec with the characteristic size of (0.5c, h, w) through a ReLU activation function and a BN layerⁱCompressing the channel to 1 through convolution operation, and obtaining the predicted saliency map Pred through Sigmoid functionⁱ. Notably due to Dec⁴And Dec⁵The same number of channels is provided, so here the number of channels is not compressed.

In this embodiment, step S31 specifically includes the following steps:

In the present embodiment, in step S4, an Adam (Adaptive moment estimation) algorithm is used to optimize the loss function in the training phase. The feature result Dec for each scale in step three is shown in FIG. 2ⁱAll correspond to a LossⁱEach of which is a LossⁱAre all predicted saliency maps Pred of FIG. 6ⁱAnd calculating a cross entropy loss with the artificial label graph.

Wherein, the calculation of the network cross entropy Loss adopts the following formula:

and optimizing by an Adam algorithm to obtain the optimal parameters of the network, and finally predicting the salient objects in the color image by using the network.

When the algorithm extracts more relevant scale features on the basis of the original scale features and then fuses the features, the fused features have stronger generalization capability. Following the idea of extracting and fusing multi-scale convolution features, the embodiment provides a multi-scale feature extraction module and a multi-scale fusion module. The module is directly embedded into a U-Net network architecture of a typical encoder-decoder structure in network design, meanwhile, the redundancy of information on a characteristic channel on the decoder structure is considered, and a channel compression module is applied to enable the model to be more efficient in calculation. In summary, the present embodiment provides a method for detecting a salient object based on multi-scale convolution feature extraction and fusion, and a network structure based on multi-scale feature extraction and fusion designed by the algorithm can significantly improve the detection accuracy of the salient object.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A salient object detection method based on multi-scale convolution feature extraction and fusion is characterized by comprising the following steps: the method comprises the following steps:

step S1: performing data enhancement, and simultaneously processing the color image and the corresponding artificial labeling image to increase the data volume of the training data set;

step S2: extracting multi-scale features, and performing channel compression to optimize the computing efficiency of the network;

Step S4: learning the optimal parameters of the model by solving the minimum cross entropy loss; finally, a trained model network is used for predicting the salient objects in the image;

step S2 specifically includes the following steps:

step S21: the inherent network structure of U-Net is improved, wherein the encoder structure of the U-Net network takes an image classification convolution network as a feature network, convolution layers and a pooling layer are combined by continuous stacking to generate convolution features of 5 different scales, and the convolution feature EnⁱAnd convolution feature Enⁱ⁺¹A pooling layer exists between the two layers, and the step length of the pooling layer is set to be 2; setting the step length of the pooling layer between the last two convolution characteristics to be 1;

step S23: adding a channel compression module to act on the multi-scale content characteristics; step S3 specifically includes the following steps:

step S32: the decoder structure of the U-Net network and the characteristic network of the encoder correspond to 5 characteristic results with different scales, and the convolutional characteristic Dec of each scale is generated by the decoder structure of the U-Net networkⁱThe multi-scale content feature Feat is fused by applying a multi-scale feature fusion moduleⁱAnd convolution characteristics Decⁱ⁺¹Here, it is assumed that the convolution characteristic Dec of the inputⁱ⁺¹The characteristic size of (c, h/2, w/2); first, the convolution feature Dec is alignedⁱ⁺¹Applying the upsampling operation to magnify twice in the spatial dimension, thereby convolving the features Decⁱ⁺¹With multi-scale content features FeatⁱThe sizes are the same in spatial dimension, and the characteristic size is (c, h, w); then the multi-scale content feature FeatⁱAnd convolution characteristics Decⁱ⁺¹Obtaining spliced features with the feature size of (2c, h, w) by applying splicing operation, and obtaining a feature result with the feature size of (c, h, w) through a ReLU activation function and a BN layer by applying convolution operation; then, a multi-scale feature fusion module is applied to the obtained feature result to obtain a feature fusion result, meanwhile, the feature result and the feature fusion result are subjected to splicing operation and convolution operation, and a feature result Dec with a feature size of (c, h, w) is obtained through a ReLU activation function and a BN layerⁱ(ii) a Finally, the convolution operation with convolution kernel size of (1,1) is applied to convert the characteristic result DecⁱThe number of channels is reduced by half in order to be matched with Dec^i-1Fusing, and obtaining a characteristic result Dec with the characteristic size of (0.5c, h, w) through a ReLU activation function and a BN layerⁱAnd is subjected to a convolution operation to compress the channel to 1,then obtaining a predicted saliency map Pred through a Sigmoid functionⁱ。

2. The method for detecting the salient objects based on the multi-scale convolution feature extraction and fusion as claimed in claim 1, wherein: step S1 specifically includes the following steps:

step S11: scaling each color image in the data set together with the corresponding artificial labeling image;

step S12: carrying out random cutting operation on each color image in the data set and the corresponding manual labeling graph;

step S13: the mirror image is generated by image level flipping.

3. The method for detecting the salient objects based on the multi-scale convolution feature extraction and fusion as claimed in claim 1, wherein: step S22 specifically includes the following steps:

step S221: designing three convolution layers with convolution characteristics EnⁱAs input, these three convolutions are all performed by a depth separable hole convolution operation, where the expansion coefficients of the hole convolution are 3, 6, 9, respectively; the feature result and convolution feature En obtained by the three operationsⁱThe feature sizes of the two-dimensional image are kept consistent, and the feature sizes are all (c, h, w);

4. The method for detecting the salient objects based on the multi-scale convolution feature extraction and fusion as claimed in claim 1, wherein: step S31 specifically includes the following steps:

5. The method for detecting the salient objects based on the multi-scale convolution feature extraction and fusion as claimed in claim 1, wherein: in step S4, the cross entropy Loss is calculated by the following equation: