CN112348870B - Significance target detection method based on residual error fusion - Google Patents

Significance target detection method based on residual error fusion Download PDF

Info

Publication number
CN112348870B
CN112348870B CN202011235626.2A CN202011235626A CN112348870B CN 112348870 B CN112348870 B CN 112348870B CN 202011235626 A CN202011235626 A CN 202011235626A CN 112348870 B CN112348870 B CN 112348870B
Authority
CN
China
Prior art keywords
feature
module
features
rgb
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011235626.2A
Other languages
Chinese (zh)
Other versions
CN112348870A (en
Inventor
张立和
金玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202011235626.2A priority Critical patent/CN112348870B/en
Publication of CN112348870A publication Critical patent/CN112348870A/en
Application granted granted Critical
Publication of CN112348870B publication Critical patent/CN112348870B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of artificial intelligence, and provides a significance target detection method based on residual error fusion. The method comprises the steps of firstly constructing a significance detection model, extracting multi-level RGB image features and depth image features through a double-flow feature extraction module, further extracting deep-level features by using a residual error module, and simultaneously gradually fusing features from RGB feature extraction branches and corresponding previous levels by using a fusion module so as to train and obtain a final algorithm model. The invention realizes the significance prediction from end to end, has low model complexity and can fully and effectively utilize RGB image information and depth image information to predict significance regions.

Description

Significance target detection method based on residual error fusion
Technical Field
The invention belongs to the technical field of artificial intelligence, relates to deep learning, and particularly relates to an image saliency detection method.
Background
Salient object detection is an important first step in the computational mechanism solution of the surrounding environment. Its task is to enable a computer to mimic the human attention mechanism to detect areas of appealing attention in an image. These attractive regions contain most of the visual information in the image. By screening out the image foreground regions containing the main visual information, the subsequent steps of image understanding can obtain cleaner and more accurate content information in the image, and can also reduce calculation and storage resources when processing the image background region, so that the overall performance of the subsequent steps of image understanding is improved. Usually, one only focuses on the areas of the image that are most attractive to human eyes, i.e. foreground areas or salient objects, while ignoring background areas. Therefore, one uses a computer to simulate the human visual system for saliency detection.
However, most of the existing significance detection based on deep learning aims at rgb images, and only the color images are relied on to ignore corresponding depth information, so that the accuracy and efficiency of significance detection are limited, and particularly when the foreground and the background are difficult to distinguish, the RGBD significance detection is generated. The RGBD saliency detection aims to accurately detect a salient object from an image with the aid of a depth image. Although some progress has been made in the significance detection of RGBD, there is still a great room for improvement. Although the appearance of devices such as a Kinect and a light field camera facilitates the acquisition of depth images, certain noise is introduced, and how to design a better algorithm to fit a model under the condition is worthy of careful consideration. Secondly, a significance detection algorithm based on deep learning generally has a problem that how to better fuse RGB information and depth information, the RGB image contains a large amount of information such as color texture, the depth image contains abundant geometric and edge information, and the depth information contains some information contained by RGB, so that how to better combine the RGB information and the depth information to complement each other and to more accurately highlight a significance region is a problem which is worthy of consideration at present.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the method makes up the defects of the existing method, provides the RGBD image saliency detection method based on residual error fusion, and achieves the purpose of obtaining higher model precision.
The technical scheme of the invention is as follows:
a salient object detection method based on residual fusion comprises the following steps:
(1) constructing a significance detection model
The significance detection model comprises a double-current feature extraction module, a multi-scale feature pyramid pooling module, a residual fusion module and a parallel upsampling module;
(2) carrying out channel copying on an original depth image corresponding to the RGB image I to obtain a depth image D;
(3) as shown in fig. 1, an RGB image I and a depth image D thereof are input into a saliency detection model, and a multi-level RGB image feature { Ii, I ═ 1,2,3,4,5} and a multi-level depth image feature { Di, I ═ 1,2,3,4,5} are extracted by an RGB image feature extraction branch (the uppermost row in fig. 1) and a depth image feature extraction branch (the lowermost row in fig. 1) in a dual-stream feature extraction module, respectively;
(4) adding a multi-scale feature pyramid pooling module to the final stage of the RGB image feature extraction branch, further extracting deep-level features through five residual error fusion modules (fig. 1 res-fuse), and performing parallel upsampling step by step to obtain a final significance prediction result P final
(5) The multiscale feature pyramid pooling module includes six sub-branches, as shown in fig. 2, and is configured to obtain context information of input feature data, where a first sub-branch employs a 1 × 1 convolutional layer, a second, a third, and a fourth sub-branches employ respectively a hole convolution with expansion rates of 3, 5, and 7, and a fifth sub-branch employs global average pooling to obtain a 1 × 1 feature representation; the sixth sub-branch adopts a direct jump connection mode to connect the input characteristic data to an output end; the first four branches further strengthen the feature expression by utilizing 1 multiplied by 1 convolutional layers and cavity convolution, and simultaneously keep the feature size and the number of channels unchanged; for the feature representation obtained by the convolution learning, the feature representation is further up-sampled to the size of the input feature respectively, and an up-sampling strategy of bilinear interpolation is adopted; finally, combining the six sub-branches in a channel cascade mode to obtain the multi-scale characteristic pyramid pooling characteristic representation of the input characteristic data;
(6) the residual fusion module is used for fusing the branches { Ii, Di, i ═ 1,2,3,4,5} from the feature extraction module, and is defined as follows:
Figure BDA0002766648240000031
wherein res _ fuse (-) represents residual fusion, Up (-) represents parallel upsampling, and C (-) represents fusion of two inputs in the channel direction; for residual fusion, two cases are distinguished: when its input is the last branch of the feature extraction module, directly connecting I 5 After passing through a multiscale feature pyramid pooling module, as rgb features, adding the features after three times of continuous convolution and ReLU operations and the features after one time of convolution and ReLU operations of the features Di to obtain a residual error fusion result(ii) a Otherwise, for the ith-level RGB image feature Ii and the ith-level depth image feature Di of the feature extraction module, firstly, the residual error fusion result obtained by Ii +1 and Di +1 is sampled in parallel, then the Ii and the Ii are fused according to the channel direction to be used as RGB features, the features obtained after three times of continuous convolution and ReLU operations are carried out on the RGB features and the features obtained after one time of convolution and ReLU operations are added according to elements to obtain a residual error fusion result;
(7) the parallel upsampling includes four sub-branches, as shown in fig. 4, which use convolutional layers with different receiving domains, respectively, and are intended to capture different local structures; then, connecting the response generated by the four convolutional layers to a tensor feature with the size of H multiplied by W multiplied by 2C, dividing H multiplied by W multiplied by 2C according to a C/2 unit and splicing and recombining according to the length H direction and the W direction respectively in order to obtain the feature which is half of the original length and width dimension, and finally obtaining the feature with the size of 2H multiplied by 2W multiplied by C/2;
(8) initializing a weight parameter on ImageNet through an vgg-16 pre-training model; in the model training phase, optimization is carried out by taking a cross entropy loss function as an objective function, an Adam optimization algorithm is used, the momentum is set to be 0.9, the weight attenuation rate is set to be 0.1, the basic learning rate is set to be 1 multiplied by 10 < -6 >, and the batch size is set to be 1.
The invention has the beneficial effects that: the method fully utilizes complementary information contained in the RGB image and the corresponding depth image, and achieves the aim of accurately predicting the saliency area in the RGBD image in a residual error fusion mode. In addition, the characteristic aggregation module and a reasonable up-sampling mode aggregate the characteristics of different scales, so that the end-to-end significance prediction can be realized by fully and effectively utilizing RGB image information and depth image information.
Drawings
Fig. 1 is a frame diagram of an RGBD saliency detection method based on residual fusion, where the top row represents RGB image feature extraction branches, and the bottom row represents depth image feature extraction branches;
FIG. 2 is a schematic diagram of a multi-scale feature pyramid pooling module;
FIG. 3 is a schematic diagram of a residual fusion module;
fig. 4 is a schematic diagram of a parallel upsampling module.
Detailed Description
The following further describes the specific embodiments of the present invention with reference to the drawings and technical solutions.
The invention is implemented as follows:
(1) constructing significance detection model
The significance detection model comprises a double-current feature extraction module, a multi-scale feature pyramid pooling module, a residual error fusion module and a parallel upsampling module.
(2) And performing channel copy on the depth image corresponding to the RGB image to obtain a three-channel image, and obtaining a depth image D.
(3) Inputting the RGB image I and the depth image D thereof into a saliency detection model, and extracting multi-level RGB image features { Ii, I ═ 1,2,3,4,5} and depth image features { Di, I ═ 1,2,3,4,5} respectively by an RGB image feature extraction branch (RGB encoder) and a depth image feature extraction branch (depth encoder) in a dual-stream feature extraction module.
(4) And adding a characteristic pyramid module to the final stage of the RGB encoder branch, further extracting deep-level characteristics through a residual fusion module, up-sampling step by step, and simultaneously supervising step by step in the extraction process to achieve the aim of global optimization.
(5) As shown in fig. 2, the multi-scale feature pooling module includes six sub-branches to obtain context information of input RGB image feature data, where a first sub-branch adopts a 1 × 1 convolutional layer, a second, a third, and a fourth sub-branches respectively adopt hole convolutions with expansion rates of 3, 5, and 7, and a fifth sub-branch adopts global average pooling to obtain a 1 × 1 feature representation; the sixth sub-branch connects the input signature to the output in a direct hop-by-hop fashion. The first four branches further enhance feature expression using 1 × 1 convolutional layers and hole convolutions while keeping feature size and number of channels unchanged. For the feature representation obtained by the convolution learning, the feature representation is further up-sampled to the size of the input feature, and an up-sampling strategy of bilinear interpolation is adopted. And the fifth branch is cascaded to the final feature map in a global average pooling way, and finally, the features of the six sub-branches are combined in a channel cascading way to obtain feature representation fused with multi-scale pooling.
(6) The residual fusion module is used for fusing branches { Ii, Di, i ═ 1,2,3,4,5} from the encoder feature extraction, and is defined as follows:
Figure BDA0002766648240000051
wherein res _ fuse (-) represents residual fusion, Up (-) represents parallel upsampling, and C (-) represents fusion of two features according to channel direction, as shown in fig. 3. It can be seen that for residual fusion, there are three input sources, namely, RGB encoder features with corresponding sizes, depth encoder features, and output features from the previous residual fusion module (when the residual block is at the rightmost end in fig. 1, there is no such portion, i is 5), the RGB encoder features with corresponding sizes and the features from the previous residual fusion module are first cascaded according to the channel direction, and then the residual fusion features are obtained when the RGB encoder features and the depth encoder features are respectively input into the residual fusion module.
(7) As shown in fig. 3, the parallel upsampling includes four sub-branches, which use convolutional layers with different receiving domains, respectively, and are intended to capture different local structures, then, the responses generated by the four convolutional layers are connected to a tensor feature with a size of H × W × 2C, in order to obtain a feature which is half of the original length and width, we can divide H × W × 2C by channels according to C/2 units, and finally, perform splicing recombination according to the length and H direction and according to the W direction, respectively, to obtain a feature with a size of 2H × 2W × C/2.
(8) Initializing a weight parameter on ImageNet through vgg-16 pre-training model; in the model training phase, optimization is carried out by taking a cross entropy loss function as an objective function, an Adam optimization algorithm is used, the momentum is set to be 0.9, the weight attenuation rate is set to be 0.1, the basic learning rate is set to be 1 multiplied by 10 < -6 >, and the batch size is set to be 1.

Claims (1)

1. A salient object detection method based on residual fusion is characterized by comprising the following steps:
(1) constructing a significance detection model
The significance detection model comprises a double-current feature extraction module, a multi-scale feature pyramid pooling module, a residual fusion module and a parallel upsampling module;
(2) carrying out channel copying on an original depth image corresponding to the RGB image I to obtain a depth image D;
(3) inputting an RGB image I and a depth image D thereof into a significance detection model, and respectively extracting multi-level RGB image features { Ii, I-1, 2,3,4,5} and multi-level depth image features { Di, I-1, 2,3,4,5} through an RGB image feature extraction branch and a depth image feature extraction branch in a double-current feature extraction module;
(4) adding a multi-scale feature pyramid pooling module to the final stage of the RGB image feature extraction branch, further extracting deep-level features through five residual fusion modules, and performing parallel upsampling step by step to obtain a final significance prediction result P final
(5) The multi-scale feature pyramid pooling module comprises six sub-branches and is used for obtaining context information of input feature data, wherein the first sub-branch adopts a 1 x 1 convolutional layer, the second, third and fourth sub-branches respectively adopt hole convolutions with expansion rates of 3, 5 and 7, and the fifth sub-branch adopts global average pooling to obtain a 1 x 1 feature representation; the sixth sub-branch adopts a direct jump connection mode to connect the input characteristic data to an output end; the first four branches further strengthen the feature expression by utilizing 1 multiplied by 1 convolutional layers and cavity convolution, and simultaneously keep the feature size and the number of channels unchanged; for the feature representation obtained by convolution learning, further up-sampling to the size of the input feature respectively, and adopting an up-sampling strategy of bilinear interpolation; finally, combining the six sub-branches in a channel cascade mode to obtain multi-scale characteristic pyramid pooled characteristic representation of the input characteristic data;
(6) the residual fusion module is used for fusing branches { Ii, Di, i ═ 1,2,3,4,5} from the dual-stream feature extraction module, and is defined as follows:
Figure FDA0002766648230000011
wherein res _ fuse (-) represents residual fusion, Up (-) represents parallel upsampling, and C (-) represents fusion of two inputs in the channel direction; for residual fusion, two cases are distinguished: when the input is the last branch of the feature extraction module, directly connecting I 5 After passing through a multi-scale feature pyramid pooling module, taking the feature as an rgb feature, performing three times of continuous convolution and ReLU operations, and adding the feature of the feature Di after performing one time of convolution and ReLU operations according to elements to obtain a residual error fusion result; otherwise, for the ith-level RGB image feature Ii and the ith-level depth image feature Di of the feature extraction module, firstly, the residual fusion result obtained by Ii +1 and Di +1 is sampled in parallel, then the residual fusion result and the Ii are fused according to the channel direction to be used as RGB features, the features obtained after three times of continuous convolution and ReLU operations are carried out on the RGB features and the features obtained after one time of convolution and ReLU operations are added according to elements to obtain a residual fusion result;
(7) the parallel upsampling comprises four sub-branches, which respectively use convolutional layers with different receiving domains, intended to capture different local structures; then, connecting the response generated by the four convolutional layers to a tensor feature with the size of H multiplied by W multiplied by 2C, dividing H multiplied by W multiplied by 2C according to a C/2 unit and splicing and recombining according to the length H direction and the W direction respectively in order to obtain the feature which is half of the original length and width dimension, and finally obtaining the feature with the size of 2H multiplied by 2W multiplied by C/2;
(8) initializing a weight parameter on ImageNet through vgg-16 pre-training model; in the model training phase, optimization is carried out by taking a cross entropy loss function as an objective function, an Adam optimization algorithm is used, the momentum is set to be 0.9, the weight attenuation rate is set to be 0.1, the basic learning rate is set to be 1 multiplied by 10 < -6 >, and the batch size is set to be 1.
CN202011235626.2A 2020-11-06 2020-11-06 Significance target detection method based on residual error fusion Active CN112348870B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011235626.2A CN112348870B (en) 2020-11-06 2020-11-06 Significance target detection method based on residual error fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011235626.2A CN112348870B (en) 2020-11-06 2020-11-06 Significance target detection method based on residual error fusion

Publications (2)

Publication Number Publication Date
CN112348870A CN112348870A (en) 2021-02-09
CN112348870B true CN112348870B (en) 2022-09-30

Family

ID=74429671

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011235626.2A Active CN112348870B (en) 2020-11-06 2020-11-06 Significance target detection method based on residual error fusion

Country Status (1)

Country Link
CN (1) CN112348870B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113205481A (en) * 2021-03-19 2021-08-03 浙江科技学院 Salient object detection method based on stepped progressive neural network
CN113344844A (en) * 2021-04-14 2021-09-03 山东师范大学 Target fruit detection method and system based on RGB-D multimode image information
CN113408350B (en) * 2021-05-17 2023-09-19 杭州电子科技大学 Remote sensing image significance detection method based on edge feature extraction
CN113486899B (en) * 2021-05-26 2023-01-24 南开大学 Saliency target detection method based on complementary branch network
CN113298154B (en) * 2021-05-27 2022-11-11 安徽大学 RGB-D image salient object detection method
CN113536973B (en) * 2021-06-28 2023-08-18 杭州电子科技大学 Traffic sign detection method based on saliency
CN113763447B (en) * 2021-08-24 2022-08-26 合肥的卢深视科技有限公司 Method for completing depth map, electronic device and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210539B (en) * 2019-05-22 2022-12-30 西安电子科技大学 RGB-T image saliency target detection method based on multi-level depth feature fusion
CN110909594A (en) * 2019-10-12 2020-03-24 杭州电子科技大学 Video significance detection method based on depth fusion
CN111242138B (en) * 2020-01-11 2022-04-01 杭州电子科技大学 RGBD significance detection method based on multi-scale feature fusion
CN111582316B (en) * 2020-04-10 2022-06-28 天津大学 RGB-D significance target detection method
CN111798436A (en) * 2020-07-07 2020-10-20 浙江科技学院 Salient object detection method based on attention expansion convolution feature fusion

Also Published As

Publication number Publication date
CN112348870A (en) 2021-02-09

Similar Documents

Publication Publication Date Title
CN112348870B (en) Significance target detection method based on residual error fusion
CN111242138B (en) RGBD significance detection method based on multi-scale feature fusion
CN111582316B (en) RGB-D significance target detection method
CN111931787A (en) RGBD significance detection method based on feature polymerization
CN112396607A (en) Streetscape image semantic segmentation method for deformable convolution fusion enhancement
CN109005398B (en) Stereo image parallax matching method based on convolutional neural network
CN110929735B (en) Rapid significance detection method based on multi-scale feature attention mechanism
CN113076957A (en) RGB-D image saliency target detection method based on cross-modal feature fusion
CN112991350A (en) RGB-T image semantic segmentation method based on modal difference reduction
CN114612832A (en) Real-time gesture detection method and device
CN112767418A (en) Mirror image segmentation method based on depth perception
CN114926734B (en) Solid waste detection device and method based on feature aggregation and attention fusion
CN111899203A (en) Real image generation method based on label graph under unsupervised training and storage medium
CN114283315A (en) RGB-D significance target detection method based on interactive guidance attention and trapezoidal pyramid fusion
CN114693929A (en) Semantic segmentation method for RGB-D bimodal feature fusion
CN113298817A (en) High-accuracy semantic segmentation method for remote sensing image
CN113888505B (en) Natural scene text detection method based on semantic segmentation
CN115908793A (en) Coding and decoding structure semantic segmentation model based on position attention mechanism
Zhang et al. Spatial-information guided adaptive context-aware network for efficient RGB-D semantic segmentation
CN113066074A (en) Visual saliency prediction method based on binocular parallax offset fusion
Wang et al. A multi-scale attentive recurrent network for image dehazing
CN113222016B (en) Change detection method and device based on cross enhancement of high-level and low-level features
CN115471718A (en) Construction and detection method of lightweight significance target detection model based on multi-scale learning
CN113962332B (en) Salient target identification method based on self-optimizing fusion feedback
CN113920317A (en) Semantic segmentation method based on visible light image and low-resolution depth image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant