CN114187230A

CN114187230A - Camouflage object detection method based on two-stage optimization network

Info

Publication number: CN114187230A
Application number: CN202111243490.4A
Authority: CN
Inventors: 姜璇; 张亚杰; 苏荔; 李国荣; 黄庆明
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences
Priority date: 2021-10-25
Filing date: 2021-10-25
Publication date: 2022-03-15

Abstract

The invention relates to the technical field of camouflage object detection, in particular to a camouflage object detection method based on a two-stage optimization network, which aims to solve the problem of insufficient detection precision in the existing camouflage object detection technology and provides a detection method based on multi-task learning, wherein object boundary information is used as assistance to guide the network to better learn the difference between the texture of a camouflage object and the background texture at the boundary, so that the network can better position and segment the camouflage object; the two-stage optimization network is divided into two stages, wherein the first stage follows an encoder-decoder structure, and ResNet50 is used as a main stem of feature extraction and is used for positioning and identifying the camouflage objects to form rough mapping. In the second stage, a parallel decoder structure is used, the object edge is used as boundary information, the network is promoted to pay attention to the object edge, and the mapping generated in the first stage is optimized.

Description

Camouflage object detection method based on two-stage optimization network

Technical Field

The invention relates to the technical field of camouflage object detection, in particular to a camouflage object detection method based on a two-stage optimization network.

Background

With the trend of diversified requirements of people on intelligent life, the application range of target detection becomes wider, and the detection of the disguised object is one of the important branches of the detection. It focuses on the relationship of objects to the surroundings, aiming at detecting and segmenting out camouflaged objects that "blend in" the surrounding environment. Camouflaging phenomena are ubiquitous in human life and nature, and are particularly common in animals. In the process of catching or avoiding natural enemies of animals, a plurality of animals can reduce the difference and contrast between the animals and the surrounding environment by changing the body color, shape, action and other modes of the animals so as to improve the self viability. These masquerading strategies are typically implemented based on the decision-making ability of a fuzzy observer.

Biological studies have shown that the Human Visual System (HVS) is most sensitive to large areas and color features, which perceive objects mainly by observing the contrast between the object and its background. Therefore, the HVS may have difficulty identifying the disguise due to its low contrast with the environment.

However, in some cases, counterfeit object identification is very necessary. In addition to the task itself of detecting animal camouflage phenomena, which can provide technical support for animal protection, there are still many passive camouflage phenomena in life, where objects and backgrounds are highly similar: in the medical field, slight changes in background tissues with extremely high similarity are likely to represent a certain lesion; in the military field, the detection of camouflage on the battlefield may also reverse the situation. The development of this task is therefore of great importance.

In recent years, deep convolutional networks have been gradually developed in various computer vision tasks with strong feature representation capability, and some existing detection methods for camouflaged objects are also realized based on the following steps: fan et al propose to stratify the extracted features. And then the characteristics of different layers are fused and enhanced to help to acquire positioning and edge information, so that the accurate detection of the camouflage target is realized. Yan et al split the MirrorNet into a stream of original image segmentation and a stream of mirror image segmentation to find the visual difference between the original image and the flip image to better locate the camouflage object.

Although these methods are proposed according to the attributes of the camouflaged object, there is room for improvement in the edge processing. Therefore, in the invention, the boundary information of the camouflage object is further considered, so that the model can better learn the difference between the camouflage object and the environment at the boundary, and the camouflage object can be more accurately positioned and segmented.

Disclosure of Invention

The invention aims to solve the problem of insufficient detection precision in the existing camouflage object detection technology, and provides a detection method based on multi-task learning, which is used for guiding a network to better learn the difference between the texture of the camouflage object and the background texture at the boundary by taking object boundary information as assistance, so that the network can better position and segment the camouflage object.

The two-stage optimization network is divided into two stages, the first stage follows the structure of an encoder-decoder, and ResNet50 is used as a main stem of feature extraction and is used for positioning and identifying the camouflage object to form rough mapping. In the second stage, a parallel decoder structure is used, the object edge is used as boundary information, the network is promoted to pay attention to the object edge, and the mapping generated in the first stage is optimized.

Further, the first stage is a pre-feature fusion stage, and ResNet50 is selected as a backbone network to ensure that deep features can be effectively extracted;

the purpose of this stage is to obtain a rough map of the disguised object, based on considerations of computational efficiency and detection accuracy, the following two modules are proposed:

(1) channel attention module:

applying a channel attention mechanism to the output of each layer of the encoder to retain useful information in the shallow features and reduce redundant information;

it aims to extract valid information and can be expressed as:

wherein the Attention indicates the channel Attention module,

the output of the bottom-up ith channel attention module,

is the ith coding block in the coding stage.

The channel attention module has 4 layers: the size of the first convolution layer is 1 × 1 to reduce the number of channels to 32 layers; two 3 x 3 convolutional layers are arranged behind the image channel, normalization is used after each convolutional layer, and the image channel is still maintained to be 32 layers after the two layers and the size is unchanged; finally, obtaining final characteristics through a Relu function;

(2) global feature and local feature fusion module:

the module is realized at the decoder stage, the structure of the module is almost symmetrical to that of an encoder, each layer of the decoder comprises two 3 x 3 convolutional layers and then uses normalization and a ReLu function, the module also introduces cSE modules and sSE modules to obtain more accurate detection results, the modules can better establish the dependency relationship between different channels and guide the network to pay attention to the characteristics related to a camouflage object, in addition, a pyramid pooling module is used for the output result of the last layer of the encoder to obtain global characteristics, and each layer of the decoder input is the combination of the output result of a corresponding channel attention module and the output result of the upper layer after being sampled:

wherein GLFA represents a decoder module in the global feature and local feature fusion module, PPM represents an introduced pyramid pooling model, Cat represents the connection of a feature map, Upesample represents an upsampling process,

to focus on the output of the module for the ith channel from bottom to top,

and outputting the ith layer in the global feature and local feature fusion module.

Thus, the decoder can learn more comprehensive semantic information and construct a prediction module to obtain the final result, which contains a 3 × 3 convolutional layer, the ELU activation function, and a 1 × 1 convolutional layer, which can be expressed as:

where ELU denotes the ELU activation function, Conv denotes the two convolution layers applied here, Upesample denotes upsampling,

the output of the block from bottom to top level 4 is shown so that the prediction and final truth maps are of the same size.

Further, the second stage is an optimization stage, and the optimization stage aims to further distinguish the camouflage object from the background by using the edge information of the object; an edge truth map is introduced as supervision information at the stage, so that the model focuses more on the difference of the object at the edge; the method comprises the following steps:

the optimization module uses a decoder structure which is the same as that of the global feature and local feature fusion module, and forms a parallel corresponding relation with the decoder structure, and the input of each layer in the optimization module is the combination of the output result of the corresponding channel attention module and the output result of the upper layer after up-sampling, so that the optimization module can further utilize the features in the previous feature fusion stage to play a role in restricting the extraction process of the features and simultaneously enable the feature reconstruction process to be more comprehensive, thereby refining the final prediction graph;

the final prediction at this stage can be expressed as:

the output of the encoder at this stage is shown from bottom to top layer 4 so that the prediction graph and the final edge true graph have the same size.

The loss of the two-stage optimization network is obtained by adding the predicted losses of the two decoders, the binary cross entropy loss is selected as a loss function, and the overall loss function is as follows:

wherein L is_totalWhich represents the overall loss of the power,

the loss of the pre-fusion stage is shown, namely pred _ c is the prediction result of the pre-feature fusion module and GT is a truth diagram;

indicating the loss of the optimization module, pred _ e is the prediction result of the edge optimization module, GT _ edge is generalAnd (4) calculating an edge true value map through the true value map.

Compared with the prior art, the invention has the beneficial effects that:

(1) the performance is good, and the result on the disclosed camouflage object detection data set shows that the method can achieve the best effect in four different evaluation indexes;

(2) the method has high efficiency, and only the extracted useful features in the framework adopted by the method are input into the decoding process, so that the times of convolution operation are greatly reduced, and the method has more practical application significance.

Drawings

FIG. 1 is a schematic diagram of a model framework;

FIG. 2 is a schematic view of a channel attention module.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

The two-stage optimization network is divided into two stages, wherein the first stage follows an encoder-decoder structure, and ResNet50 is used as a main stem of feature extraction and is used for positioning and identifying the camouflage objects to form rough mapping. In the second stage, a parallel decoder structure is used, the object edge is used as boundary information, the network is promoted to pay attention to the object edge, and the mapping generated in the first stage is optimized.

As a preferred scheme of the above embodiment, the first stage is a pre-feature fusion stage, and ResNet50 is selected as a backbone network to ensure that deep features can be effectively extracted;

(1) channel attention module:

in the CNN, different channels respond to different semantics, and features of different levels contain detail information and full-text information of different degrees, so that in the process of extracting the features by an encoder based on ResNet50, although the range of an original image can be seen by the output of deep convolution is larger, a lot of detail information is lost; although the detail information is reserved in the output of the shallow layer, the detail information is not all useful information, so that a channel attention mechanism is applied to the output of each layer of the encoder to reserve the useful information in the shallow layer characteristics and reduce redundant information;

it aims to extract valid information and can be expressed as:

wherein the Attention indicates the channel Attention module,

the output of the bottom-up ith channel attention module,

is the ith coding block in the coding stage.

In addition, since the number of channels input to each layer of the decoder becomes 32 after passing through the channel attention module, the number of parameters in the model is greatly reduced, the size of the model is reduced, and the training and reasoning speed is increased. The channel attention module has 4 layers: the size of the first convolution layer is 1 × 1 to reduce the number of channels to 32 layers; two 3 x 3 convolutional layers are arranged behind the image channel, normalization is used after each convolutional layer, and the image channel is still maintained to be 32 layers after the two layers and the size is unchanged; finally, obtaining final characteristics through a Relu function;

(2) global feature and local feature fusion module:

to focus on the output of the module for the ith channel from bottom to top,

representing the output of the module from the bottom up to layer 4. So that the prediction map and the final truth map have the same size.

As a preferred solution of the above embodiment, the second stage is an optimization stage, and the detection task of the disguised object is challenging due to the high similarity between the object and the environment, so the optimization stage aims to further distinguish the disguised object from the background by using the object edge information; an edge truth map is introduced as supervision information at the stage, so that the model focuses more on the difference of the object at the edge; the method comprises the following steps:

the final prediction at this stage can be expressed as:

representing the output of the encoder from the bottom up to layer 4 at this stage. So that the prediction map and the final edge true value map have the same size.

wherein L is_tot8lWhich represents the overall loss of the power,

and the loss of the optimization module is represented, pred _ e is a prediction result of the edge optimization module, and GT _ edge is an edge true value graph obtained by calculating a true value graph.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A camouflage object detection method based on a two-stage optimization network is characterized in that the two-stage optimization network is divided into two stages, the first stage follows the structure of a coder decoder, and ResNet50 is used as a main stem of feature extraction and is used for positioning and identifying a camouflage object to form rough mapping; in the second stage, a parallel decoder structure is used, the object edge is used as boundary information, the network is promoted to pay attention to the object edge, and the mapping generated in the first stage is optimized.

2. The method for detecting the disguised objects based on the two-stage optimization network as claimed in claim 1, wherein the first stage is a previous feature fusion stage, and ResNet50 is selected as a backbone network to ensure that deep features can be effectively extracted;

(1) channel attention module:

it aims to extract valid information and can be expressed as:

wherein the Attention indicates the channel Attention module,

the output of the bottom-up ith channel attention module,

the ith coding block in the coding stage;

(2) global feature and local feature fusion module:

to focus on the output of the module for the ith channel from bottom to top,

outputting the ith layer in the global feature and local feature fusion module;

3. The method for detecting a disguised object based on a two-stage optimization network as claimed in claim 2, wherein the second stage is an optimization stage, which aims to further distinguish the disguised object from the background by using the object edge information; an edge truth map is introduced as supervision information at the stage, so that the model focuses more on the difference of the object at the edge; the method comprises the following steps:

the final prediction at this stage can be expressed as:

representing the output result of the encoder from the bottom to the top layer 4 in the stage, so that the prediction graph and the final edge true value graph have the same size;

wherein L is_totalWhich represents the overall loss of the power,

indicating loss of the optimization block, pred _ e being an edgeAnd the GT _ edge is an edge true value graph obtained by calculating a truth value graph.