CN110020658B

CN110020658B - Salient object detection method based on multitask deep learning

Info

Publication number: CN110020658B
Application number: CN201910243220.XA
Authority: CN
Inventors: 张立和; 吴杰
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2019-03-28
Filing date: 2019-03-28
Publication date: 2022-09-30
Anticipated expiration: 2039-03-28
Also published as: CN110020658A

Abstract

The invention belongs to the technical field of deep learning, and discloses a method for detecting a salient target based on multi-task deep learning, which is characterized in that a salient target detection network based on multi-task is provided, on the basis of the conventional VGG16 model, more local and semantic information is obtained by introducing a residual error module with semantic comparison local characteristics, and then interactive learning of two task networks is carried out, so that the two networks can mutually learn the characteristics of each other to complement the defects of the characteristics of the two networks. Compared with the prior method, the method has more accurate detection result. For images with multiple targets or targets similar to the background, the detection result of the method provided by the invention is more in line with the visual perception of human, and the obtained saliency map is more accurate. In addition, the edge of the detection result of the obvious target is greatly improved due to the sensitivity of another target contour network to the target contour.

Description

Salient object detection method based on multitask deep learning

Technical Field

The invention belongs to the technical field of deep learning, and relates to a task in computer vision, which is called salient object detection.

Background

With the development of science and technology, information such as images and videos received by people is explosively increased. How to rapidly and effectively process image data becomes a difficult problem to be solved urgently in front of people. Usually, one only focuses on more salient regions in the image that attract the attention of the human eye, i.e. foreground regions or salient objects, while disregarding background regions. Therefore, one uses a computer to simulate the human visual system for saliency detection. At present, salient object detection can be applied as a preprocessing step in various fields of computer vision, including image retrieval, image compression, object recognition, image segmentation and the like.

In saliency detection, how to accurately detect a salient object from an image is a very important problem. The traditional saliency detection method has many defects, and particularly when the situation that a complex multi-target image or a salient target is similar to a background is faced, the detection result is often inaccurate. There is also the problem that edge details are not detected in place.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: on the basis of utilizing a deep network, a novel method for detecting the obvious target is provided, so that the detection result is more accurate.

The technical scheme of the invention is as follows:

a method for detecting a salient object based on multitask deep learning comprises the following steps:

(1) adding modules on the basis of a VGG16 network to respectively obtain a significant target detection task network and a target contour detection task network, wherein each deconvolution module of the significant target detection task network only comprises a feature interaction module and a residual error module based on semantic comparison local features; the deconvolution module of the target contour detection task network only comprises a feature interaction module and a basic convolution layer; the encoding part is a basic VGG16 network and consists of a plurality of convolution modules, and the image is down-sampled step by step into high-level features; the decoding part consists of a plurality of deconvolution modules, each deconvolution module samples the features twice, and the deconvolution modules sample the features of the highest layer of the coding part to the size of an original image step by step to predict tasks;

(2) a residual error module based on semantic comparison local features in the salient object detection task network; respectively extracting local features and semantic features from the residual error module based on semantic comparison local features;

(3) in order to realize good interaction between the two task networks, a characteristic interaction module is designed to ensure that the obvious target detection task network and the target contour detection task network are mutually promoted; the feature interaction module is only used for a decoding part of the task network; for the interaction of the two task networks, the two task networks are alternately trained; when any task network is trained, in a feature interaction module of the task network, four parts of features are taken as input, including the output feature S of a deconvolution module in front of the feature interaction module of the current task network _t And the feature S sampled twice thereon _t ^up And S _t ^up Convolution module output characteristic S in decoding part with same size _t ^encoder Output characteristic C of up-sampling twice of deconvolution module corresponding to another task network _t ^up (ii) a In the feature interaction module, the last three mentioned features are connected (concat) according to the channel level; then deconvoluting the characteristic of the current task network before the interactive moduleOutput characteristic S of the module _t Carrying out global average pooling (Gap) operation to obtain an attention channel vector; then, performing convolution operation on the attention channel vector by 1x1 to make the length of the attention channel vector be the same as the number of the characteristic channels connected in advance; then, a sigmoid function is used for enabling the vector value to be between 0 and 1; finally, the attention vector is used for weighting each channel of the connected features to screen the connected features, so that the features after feature interaction are all the features which are most favorable for the current task;

(4) for the attention vector in the step (3), a sparse convolution module is provided, so that the attention vector becomes sparse, and the generalization capability of the model is further improved;

(5) carrying out true value supervision on the final output of the deconvolution module of each network to train the network; and finally, performing softmax processing on the prediction result of the last deconvolution module of the decoding network to obtain a final prediction result.

The invention has the beneficial effects that: the multi-task-based saliency target detection network provided by the invention acquires more local and semantic information by introducing a residual error module with semantic comparison local features on the basis of the existing VGG16 basic model, and then the two task networks are subjected to interactive learning, so that the two networks can mutually learn the features of each other to complement the defects of the features of the two networks. Compared with the prior method, the method has more accurate detection result. For images with multiple targets or targets similar to the background, the detection result of the method provided by the invention is more in line with the visual perception of human, and the obtained saliency map is more accurate. In addition, the edge of the detection result of the obvious target is greatly improved due to the sensitivity of another target contour network to the target contour.

Drawings

Fig. 1 is a general block diagram of a network in which the method of the present invention is implemented.

FIG. 2 is a block diagram of a sparse convolution in the method of the present invention.

FIG. 3 is a diagram of semantic comparison local feature residual module in the method of the present invention.

FIG. 4 shows the results of a test performed on a number of images by the method of the present invention.

Detailed Description

The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.

The conception of the invention is as follows: and the complementarity among all tasks in the multi-task network is utilized to mutually improve all tasks, and finally, the result of the obvious target detection is improved. In the network, except for providing a characteristic interaction module with pertinence for interaction among multiple tasks, some modules are added to obtain more local and semantic information or improve the generalization capability of the model. Finally, an alternate training mode is adopted, so that the two tasks can learn the characteristics of the other party more specifically, and the final detection result is more accurate.

The invention is implemented as follows:

(2) a residual error module based on semantic comparison local features in the salient object detection task network; respectively extracting local features and semantic features from the residual error module based on semantic comparison local features, and defining the following steps:

F _out ＝F _in +(f _l (F _in ；W _l )-f _c (F _in ；W _c ))

wherein: f _in Input features of residual module based on semantic comparison of local features, F _out Is the final output feature of the residual module based on semantic comparison local features; f. of _l (. represents a partial convolution operation, W _l Is a convolution parameter of the convolution; f. of _c (. represents a convolution operation that extracts semantics, W _c Is a parameter of the convolution; subtracting the obtained local features and semantic features to obtain comparison features, and adding the comparison features and the original features to obtain final output features;

(3) in order to realize good interaction between the two task networks, a characteristic interaction module is designed, so that the obvious target detection task network and the target contour detection task network are mutually promoted; the feature interaction module is only used for a decoding part of the task network; for the interaction of the two task networks, the two task networks are alternately trained; when any one task network is trained, in a feature interaction module of the task network, four parts of features are taken as input, including the output feature S of a deconvolution module before the feature interaction module of the current task network _t And the feature S sampled twice thereon _t ^up And S _t ^up Convolution module output characteristic S in decoding part with same size _t ^encoder Output characteristic C of up-sampling twice of deconvolution module corresponding to another task network _t ^up (ii) a In the feature interaction module, the last three features are connected according to a channel layer; then, the output characteristic S of a deconvolution module before the characteristic interaction module of the current task network is processed _t Carrying out global average pooling operation to obtain an attention channel vector; then, performing convolution operation on the attention channel vector by 1x1 to make the length of the attention channel vector be equal to the number of the characteristic channels connected before; then, a sigmoid function is used for enabling the vector value to be between 0 and 1; finally, the attention vector is used for weighting each channel of the connected features to screen the connected features, so that the features after feature interaction are all the features which are most favorable for the current task, and the specific definition is as follows:

Claims

1. A method for detecting a salient object based on multitask deep learning is characterized by comprising the following steps:

F _out ＝F _in +(f _l (F _in ；W _l )-f _c (F _in ；W _c ))

wherein: f _in Input features of residual module based on semantic comparison of local features, F _out Is the final output feature of the residual module based on semantic comparison local features; f. of _l (. represents a partial volumeProduct operation, W _l Is a convolution parameter of the convolution; f. of _c (. represents a convolution operation that extracts semantics, W _c Is a parameter of the convolution; subtracting the obtained local features and semantic features to obtain comparison features, and adding the comparison features and the original features to obtain final output features;

(3) in order to realize good interaction between the two task networks, a characteristic interaction module is designed to ensure that the obvious target detection task network and the target contour detection task network are mutually promoted; the feature interaction module is only used for a decoding part of the task network; for the interaction of the two task networks, the two task networks are alternately trained; when any one task network is trained, in a feature interaction module of the task network, four parts of features are taken as input, including the output feature S of a deconvolution module before the feature interaction module of the current task network _t And the feature S sampled twice thereon _t ^up And S _t ^up Deconvolution module output characteristic S with same size and positioned in decoding part _t ^encoder Output characteristic C of up-sampling twice of deconvolution module corresponding to another task network _t ^up (ii) a In the feature interaction module, the last three features are connected according to a channel layer; then, the output characteristics S of a deconvolution module before the characteristic interaction module of the current task network are processed _t Carrying out global average pooling operation to obtain an attention channel vector; then, performing convolution operation on the attention channel vector by 1x1 to make the length of the attention channel vector be equal to the number of the characteristic channels connected before; then, a sigmoid function is used for enabling the vector value to be between 0 and 1; finally, the attention vector is used for weighting each channel of the connected features to screen the connected features, so that the features after feature interaction are all the features which are most favorable for the current task, and the specific definition is as follows: