CN114187230A - Camouflage object detection method based on two-stage optimization network - Google Patents
Camouflage object detection method based on two-stage optimization network Download PDFInfo
- Publication number
- CN114187230A CN114187230A CN202111243490.4A CN202111243490A CN114187230A CN 114187230 A CN114187230 A CN 114187230A CN 202111243490 A CN202111243490 A CN 202111243490A CN 114187230 A CN114187230 A CN 114187230A
- Authority
- CN
- China
- Prior art keywords
- stage
- module
- layer
- network
- decoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Quality & Reliability (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the technical field of camouflage object detection, in particular to a camouflage object detection method based on a two-stage optimization network, which aims to solve the problem of insufficient detection precision in the existing camouflage object detection technology and provides a detection method based on multi-task learning, wherein object boundary information is used as assistance to guide the network to better learn the difference between the texture of a camouflage object and the background texture at the boundary, so that the network can better position and segment the camouflage object; the two-stage optimization network is divided into two stages, wherein the first stage follows an encoder-decoder structure, and ResNet50 is used as a main stem of feature extraction and is used for positioning and identifying the camouflage objects to form rough mapping. In the second stage, a parallel decoder structure is used, the object edge is used as boundary information, the network is promoted to pay attention to the object edge, and the mapping generated in the first stage is optimized.
Description
Technical Field
The invention relates to the technical field of camouflage object detection, in particular to a camouflage object detection method based on a two-stage optimization network.
Background
With the trend of diversified requirements of people on intelligent life, the application range of target detection becomes wider, and the detection of the disguised object is one of the important branches of the detection. It focuses on the relationship of objects to the surroundings, aiming at detecting and segmenting out camouflaged objects that "blend in" the surrounding environment. Camouflaging phenomena are ubiquitous in human life and nature, and are particularly common in animals. In the process of catching or avoiding natural enemies of animals, a plurality of animals can reduce the difference and contrast between the animals and the surrounding environment by changing the body color, shape, action and other modes of the animals so as to improve the self viability. These masquerading strategies are typically implemented based on the decision-making ability of a fuzzy observer.
Biological studies have shown that the Human Visual System (HVS) is most sensitive to large areas and color features, which perceive objects mainly by observing the contrast between the object and its background. Therefore, the HVS may have difficulty identifying the disguise due to its low contrast with the environment.
However, in some cases, counterfeit object identification is very necessary. In addition to the task itself of detecting animal camouflage phenomena, which can provide technical support for animal protection, there are still many passive camouflage phenomena in life, where objects and backgrounds are highly similar: in the medical field, slight changes in background tissues with extremely high similarity are likely to represent a certain lesion; in the military field, the detection of camouflage on the battlefield may also reverse the situation. The development of this task is therefore of great importance.
In recent years, deep convolutional networks have been gradually developed in various computer vision tasks with strong feature representation capability, and some existing detection methods for camouflaged objects are also realized based on the following steps: fan et al propose to stratify the extracted features. And then the characteristics of different layers are fused and enhanced to help to acquire positioning and edge information, so that the accurate detection of the camouflage target is realized. Yan et al split the MirrorNet into a stream of original image segmentation and a stream of mirror image segmentation to find the visual difference between the original image and the flip image to better locate the camouflage object.
Although these methods are proposed according to the attributes of the camouflaged object, there is room for improvement in the edge processing. Therefore, in the invention, the boundary information of the camouflage object is further considered, so that the model can better learn the difference between the camouflage object and the environment at the boundary, and the camouflage object can be more accurately positioned and segmented.
Disclosure of Invention
The invention aims to solve the problem of insufficient detection precision in the existing camouflage object detection technology, and provides a detection method based on multi-task learning, which is used for guiding a network to better learn the difference between the texture of the camouflage object and the background texture at the boundary by taking object boundary information as assistance, so that the network can better position and segment the camouflage object.
The two-stage optimization network is divided into two stages, the first stage follows the structure of an encoder-decoder, and ResNet50 is used as a main stem of feature extraction and is used for positioning and identifying the camouflage object to form rough mapping. In the second stage, a parallel decoder structure is used, the object edge is used as boundary information, the network is promoted to pay attention to the object edge, and the mapping generated in the first stage is optimized.
Further, the first stage is a pre-feature fusion stage, and ResNet50 is selected as a backbone network to ensure that deep features can be effectively extracted;
the purpose of this stage is to obtain a rough map of the disguised object, based on considerations of computational efficiency and detection accuracy, the following two modules are proposed:
(1) channel attention module:
applying a channel attention mechanism to the output of each layer of the encoder to retain useful information in the shallow features and reduce redundant information;
it aims to extract valid information and can be expressed as:
wherein the Attention indicates the channel Attention module,the output of the bottom-up ith channel attention module,is the ith coding block in the coding stage.
The channel attention module has 4 layers: the size of the first convolution layer is 1 × 1 to reduce the number of channels to 32 layers; two 3 x 3 convolutional layers are arranged behind the image channel, normalization is used after each convolutional layer, and the image channel is still maintained to be 32 layers after the two layers and the size is unchanged; finally, obtaining final characteristics through a Relu function;
(2) global feature and local feature fusion module:
the module is realized at the decoder stage, the structure of the module is almost symmetrical to that of an encoder, each layer of the decoder comprises two 3 x 3 convolutional layers and then uses normalization and a ReLu function, the module also introduces cSE modules and sSE modules to obtain more accurate detection results, the modules can better establish the dependency relationship between different channels and guide the network to pay attention to the characteristics related to a camouflage object, in addition, a pyramid pooling module is used for the output result of the last layer of the encoder to obtain global characteristics, and each layer of the decoder input is the combination of the output result of a corresponding channel attention module and the output result of the upper layer after being sampled:
wherein GLFA represents a decoder module in the global feature and local feature fusion module, PPM represents an introduced pyramid pooling model, Cat represents the connection of a feature map, Upesample represents an upsampling process,to focus on the output of the module for the ith channel from bottom to top,and outputting the ith layer in the global feature and local feature fusion module.
Thus, the decoder can learn more comprehensive semantic information and construct a prediction module to obtain the final result, which contains a 3 × 3 convolutional layer, the ELU activation function, and a 1 × 1 convolutional layer, which can be expressed as:
where ELU denotes the ELU activation function, Conv denotes the two convolution layers applied here, Upesample denotes upsampling,the output of the block from bottom to top level 4 is shown so that the prediction and final truth maps are of the same size.
Further, the second stage is an optimization stage, and the optimization stage aims to further distinguish the camouflage object from the background by using the edge information of the object; an edge truth map is introduced as supervision information at the stage, so that the model focuses more on the difference of the object at the edge; the method comprises the following steps:
the optimization module uses a decoder structure which is the same as that of the global feature and local feature fusion module, and forms a parallel corresponding relation with the decoder structure, and the input of each layer in the optimization module is the combination of the output result of the corresponding channel attention module and the output result of the upper layer after up-sampling, so that the optimization module can further utilize the features in the previous feature fusion stage to play a role in restricting the extraction process of the features and simultaneously enable the feature reconstruction process to be more comprehensive, thereby refining the final prediction graph;
the final prediction at this stage can be expressed as:
where ELU denotes the ELU activation function, Conv denotes the two convolution layers applied here, Upesample denotes upsampling,the output of the encoder at this stage is shown from bottom to top layer 4 so that the prediction graph and the final edge true graph have the same size.
The loss of the two-stage optimization network is obtained by adding the predicted losses of the two decoders, the binary cross entropy loss is selected as a loss function, and the overall loss function is as follows:
wherein L istotalWhich represents the overall loss of the power,the loss of the pre-fusion stage is shown, namely pred _ c is the prediction result of the pre-feature fusion module and GT is a truth diagram;indicating the loss of the optimization module, pred _ e is the prediction result of the edge optimization module, GT _ edge is generalAnd (4) calculating an edge true value map through the true value map.
Compared with the prior art, the invention has the beneficial effects that:
(1) the performance is good, and the result on the disclosed camouflage object detection data set shows that the method can achieve the best effect in four different evaluation indexes;
(2) the method has high efficiency, and only the extracted useful features in the framework adopted by the method are input into the decoding process, so that the times of convolution operation are greatly reduced, and the method has more practical application significance.
Drawings
FIG. 1 is a schematic diagram of a model framework;
FIG. 2 is a schematic view of a channel attention module.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
The two-stage optimization network is divided into two stages, wherein the first stage follows an encoder-decoder structure, and ResNet50 is used as a main stem of feature extraction and is used for positioning and identifying the camouflage objects to form rough mapping. In the second stage, a parallel decoder structure is used, the object edge is used as boundary information, the network is promoted to pay attention to the object edge, and the mapping generated in the first stage is optimized.
As a preferred scheme of the above embodiment, the first stage is a pre-feature fusion stage, and ResNet50 is selected as a backbone network to ensure that deep features can be effectively extracted;
the purpose of this stage is to obtain a rough map of the disguised object, based on considerations of computational efficiency and detection accuracy, the following two modules are proposed:
(1) channel attention module:
in the CNN, different channels respond to different semantics, and features of different levels contain detail information and full-text information of different degrees, so that in the process of extracting the features by an encoder based on ResNet50, although the range of an original image can be seen by the output of deep convolution is larger, a lot of detail information is lost; although the detail information is reserved in the output of the shallow layer, the detail information is not all useful information, so that a channel attention mechanism is applied to the output of each layer of the encoder to reserve the useful information in the shallow layer characteristics and reduce redundant information;
it aims to extract valid information and can be expressed as:
wherein the Attention indicates the channel Attention module,the output of the bottom-up ith channel attention module,is the ith coding block in the coding stage.
In addition, since the number of channels input to each layer of the decoder becomes 32 after passing through the channel attention module, the number of parameters in the model is greatly reduced, the size of the model is reduced, and the training and reasoning speed is increased. The channel attention module has 4 layers: the size of the first convolution layer is 1 × 1 to reduce the number of channels to 32 layers; two 3 x 3 convolutional layers are arranged behind the image channel, normalization is used after each convolutional layer, and the image channel is still maintained to be 32 layers after the two layers and the size is unchanged; finally, obtaining final characteristics through a Relu function;
(2) global feature and local feature fusion module:
the module is realized at the decoder stage, the structure of the module is almost symmetrical to that of an encoder, each layer of the decoder comprises two 3 x 3 convolutional layers and then uses normalization and a ReLu function, the module also introduces cSE modules and sSE modules to obtain more accurate detection results, the modules can better establish the dependency relationship between different channels and guide the network to pay attention to the characteristics related to a camouflage object, in addition, a pyramid pooling module is used for the output result of the last layer of the encoder to obtain global characteristics, and each layer of the decoder input is the combination of the output result of a corresponding channel attention module and the output result of the upper layer after being sampled:
wherein GLFA represents a decoder module in the global feature and local feature fusion module, PPM represents an introduced pyramid pooling model, Cat represents the connection of a feature map, Upesample represents an upsampling process,to focus on the output of the module for the ith channel from bottom to top,and outputting the ith layer in the global feature and local feature fusion module.
Thus, the decoder can learn more comprehensive semantic information and construct a prediction module to obtain the final result, which contains a 3 × 3 convolutional layer, the ELU activation function, and a 1 × 1 convolutional layer, which can be expressed as:
where ELU denotes the ELU activation function, Conv denotes the two convolution layers applied here, Upesample denotes upsampling,representing the output of the module from the bottom up to layer 4. So that the prediction map and the final truth map have the same size.
As a preferred solution of the above embodiment, the second stage is an optimization stage, and the detection task of the disguised object is challenging due to the high similarity between the object and the environment, so the optimization stage aims to further distinguish the disguised object from the background by using the object edge information; an edge truth map is introduced as supervision information at the stage, so that the model focuses more on the difference of the object at the edge; the method comprises the following steps:
the optimization module uses a decoder structure which is the same as that of the global feature and local feature fusion module, and forms a parallel corresponding relation with the decoder structure, and the input of each layer in the optimization module is the combination of the output result of the corresponding channel attention module and the output result of the upper layer after up-sampling, so that the optimization module can further utilize the features in the previous feature fusion stage to play a role in restricting the extraction process of the features and simultaneously enable the feature reconstruction process to be more comprehensive, thereby refining the final prediction graph;
the final prediction at this stage can be expressed as:
where ELU denotes the ELU activation function, Conv denotes the two convolution layers applied here, Upesample denotes upsampling,representing the output of the encoder from the bottom up to layer 4 at this stage. So that the prediction map and the final edge true value map have the same size.
The loss of the two-stage optimization network is obtained by adding the predicted losses of the two decoders, the binary cross entropy loss is selected as a loss function, and the overall loss function is as follows:
wherein L istot8lWhich represents the overall loss of the power,the loss of the pre-fusion stage is shown, namely pred _ c is the prediction result of the pre-feature fusion module and GT is a truth diagram;and the loss of the optimization module is represented, pred _ e is a prediction result of the edge optimization module, and GT _ edge is an edge true value graph obtained by calculating a true value graph.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.
Claims (3)
1. A camouflage object detection method based on a two-stage optimization network is characterized in that the two-stage optimization network is divided into two stages, the first stage follows the structure of a coder decoder, and ResNet50 is used as a main stem of feature extraction and is used for positioning and identifying a camouflage object to form rough mapping; in the second stage, a parallel decoder structure is used, the object edge is used as boundary information, the network is promoted to pay attention to the object edge, and the mapping generated in the first stage is optimized.
2. The method for detecting the disguised objects based on the two-stage optimization network as claimed in claim 1, wherein the first stage is a previous feature fusion stage, and ResNet50 is selected as a backbone network to ensure that deep features can be effectively extracted;
the purpose of this stage is to obtain a rough map of the disguised object, based on considerations of computational efficiency and detection accuracy, the following two modules are proposed:
(1) channel attention module:
applying a channel attention mechanism to the output of each layer of the encoder to retain useful information in the shallow features and reduce redundant information;
it aims to extract valid information and can be expressed as:
wherein the Attention indicates the channel Attention module,the output of the bottom-up ith channel attention module,the ith coding block in the coding stage;
the channel attention module has 4 layers: the size of the first convolution layer is 1 × 1 to reduce the number of channels to 32 layers; two 3 x 3 convolutional layers are arranged behind the image channel, normalization is used after each convolutional layer, and the image channel is still maintained to be 32 layers after the two layers and the size is unchanged; finally, obtaining final characteristics through a Relu function;
(2) global feature and local feature fusion module:
the module is realized at the decoder stage, the structure of the module is almost symmetrical to that of an encoder, each layer of the decoder comprises two 3 x 3 convolutional layers and then uses normalization and a ReLu function, the module also introduces cSE modules and sSE modules to obtain more accurate detection results, the modules can better establish the dependency relationship between different channels and guide the network to pay attention to the characteristics related to a camouflage object, in addition, a pyramid pooling module is used for the output result of the last layer of the encoder to obtain global characteristics, and each layer of the decoder input is the combination of the output result of a corresponding channel attention module and the output result of the upper layer after being sampled:
wherein GLFA represents a decoder module in the global feature and local feature fusion module, PPM represents an introduced pyramid pooling model, Cat represents the connection of a feature map, Upesample represents an upsampling process,to focus on the output of the module for the ith channel from bottom to top,outputting the ith layer in the global feature and local feature fusion module;
thus, the decoder can learn more comprehensive semantic information and construct a prediction module to obtain the final result, which contains a 3 × 3 convolutional layer, the ELU activation function, and a 1 × 1 convolutional layer, which can be expressed as:
3. The method for detecting a disguised object based on a two-stage optimization network as claimed in claim 2, wherein the second stage is an optimization stage, which aims to further distinguish the disguised object from the background by using the object edge information; an edge truth map is introduced as supervision information at the stage, so that the model focuses more on the difference of the object at the edge; the method comprises the following steps:
the optimization module uses a decoder structure which is the same as that of the global feature and local feature fusion module, and forms a parallel corresponding relation with the decoder structure, and the input of each layer in the optimization module is the combination of the output result of the corresponding channel attention module and the output result of the upper layer after up-sampling, so that the optimization module can further utilize the features in the previous feature fusion stage to play a role in restricting the extraction process of the features and simultaneously enable the feature reconstruction process to be more comprehensive, thereby refining the final prediction graph;
the final prediction at this stage can be expressed as:
where ELU denotes the ELU activation function, Conv denotes the two convolution layers applied here, Upesample denotes upsampling,representing the output result of the encoder from the bottom to the top layer 4 in the stage, so that the prediction graph and the final edge true value graph have the same size;
the loss of the two-stage optimization network is obtained by adding the predicted losses of the two decoders, the binary cross entropy loss is selected as a loss function, and the overall loss function is as follows:
wherein L istotalWhich represents the overall loss of the power,the loss of the pre-fusion stage is shown, namely pred _ c is the prediction result of the pre-feature fusion module and GT is a truth diagram;indicating loss of the optimization block, pred _ e being an edgeAnd the GT _ edge is an edge true value graph obtained by calculating a truth value graph.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111243490.4A CN114187230A (en) | 2021-10-25 | 2021-10-25 | Camouflage object detection method based on two-stage optimization network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111243490.4A CN114187230A (en) | 2021-10-25 | 2021-10-25 | Camouflage object detection method based on two-stage optimization network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114187230A true CN114187230A (en) | 2022-03-15 |
Family
ID=80601455
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111243490.4A Pending CN114187230A (en) | 2021-10-25 | 2021-10-25 | Camouflage object detection method based on two-stage optimization network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114187230A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115223018A (en) * | 2022-06-08 | 2022-10-21 | 东北石油大学 | Cooperative detection method and device for disguised object, electronic device and storage medium |
CN115631346A (en) * | 2022-11-11 | 2023-01-20 | 南京航空航天大学 | Disguised object detection method and system based on uncertainty modeling |
-
2021
- 2021-10-25 CN CN202111243490.4A patent/CN114187230A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115223018A (en) * | 2022-06-08 | 2022-10-21 | 东北石油大学 | Cooperative detection method and device for disguised object, electronic device and storage medium |
CN115631346A (en) * | 2022-11-11 | 2023-01-20 | 南京航空航天大学 | Disguised object detection method and system based on uncertainty modeling |
CN115631346B (en) * | 2022-11-11 | 2023-07-18 | 南京航空航天大学 | Uncertainty modeling-based camouflage object detection method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111091130A (en) | Real-time image semantic segmentation method and system based on lightweight convolutional neural network | |
Andonian et al. | Robust cross-modal representation learning with progressive self-distillation | |
CN114187230A (en) | Camouflage object detection method based on two-stage optimization network | |
CN111723786A (en) | Method and device for detecting wearing of safety helmet based on single model prediction | |
CN115984172A (en) | Small target detection method based on enhanced feature extraction | |
CN115809327B (en) | Real-time social network rumor detection method based on multimode fusion and topics | |
CN112836625A (en) | Face living body detection method and device and electronic equipment | |
CN114255403A (en) | Optical remote sensing image data processing method and system based on deep learning | |
CN112149526B (en) | Lane line detection method and system based on long-distance information fusion | |
CN115082928B (en) | Method for asymmetric double-branch real-time semantic segmentation network facing complex scene | |
CN113269133A (en) | Unmanned aerial vehicle visual angle video semantic segmentation method based on deep learning | |
CN114627441A (en) | Unstructured road recognition network training method, application method and storage medium | |
CN114882351B (en) | Multi-target detection and tracking method based on improved YOLO-V5s | |
Lu et al. | Mfnet: Multi-feature fusion network for real-time semantic segmentation in road scenes | |
Zhao et al. | YOLO‐Highway: An Improved Highway Center Marking Detection Model for Unmanned Aerial Vehicle Autonomous Flight | |
CN113052090A (en) | Method and apparatus for generating subtitle and outputting subtitle | |
CN112598032A (en) | Multi-task defense model construction method for anti-attack of infrared image | |
CN116342877A (en) | Semantic segmentation method based on improved ASPP and fusion module in complex scene | |
Liu et al. | IR ship target saliency detection based on lightweight non-local depth features | |
CN115019139A (en) | Light field significant target detection method based on double-current network | |
CN114241470A (en) | Natural scene character detection method based on attention mechanism | |
Liu et al. | L2-LiteSeg: A Real-Time Semantic Segmentation Method for End-to-End Autonomous Driving | |
CN112966569B (en) | Image processing method and device, computer equipment and storage medium | |
Zhou et al. | FENet: Fast Real-time Semantic Edge Detection Network | |
CN117635628B (en) | Sea-land segmentation method based on context attention and boundary perception guidance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |