CN113313077A - Salient object detection method based on multi-strategy and cross feature fusion - Google Patents
Salient object detection method based on multi-strategy and cross feature fusion Download PDFInfo
- Publication number
- CN113313077A CN113313077A CN202110743443.XA CN202110743443A CN113313077A CN 113313077 A CN113313077 A CN 113313077A CN 202110743443 A CN202110743443 A CN 202110743443A CN 113313077 A CN113313077 A CN 113313077A
- Authority
- CN
- China
- Prior art keywords
- neural network
- convolutional neural
- strategy
- fusion
- pixel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a salient object detection method based on multi-strategy fusion, and relates to the field of deep learning. In a training stage, a convolutional neural network is constructed, and hidden layers of the convolutional neural network comprise 10 neural network convolution blocks, 5 multi-strategy fusion blocks and 4 cross characteristic fusion blocks; inputting the original RGB color image and Depth image into a convolutional neural network for training to obtain a corresponding salient physical detection image; then, calculating loss function values of an original prediction graph and a corresponding real salient label graph (Ground Truth) to obtain an optimal weight vector and a bias term of the convolutional neural network classification training model; in the testing stage, inputting the RGB color image of the salient body to be detected and the corresponding Depth image into a convolutional neural network classification training model together to obtain a prediction salient body detection image; the method has the advantage of improving the detection efficiency and accuracy of the RGB-D significant object.
Description
Technical Field
The invention relates to the field of deep learning, in particular to a salient object detection method based on multi-strategy and cross feature fusion.
Background
The Salient Object Detection (SOD) plays an important role in many computer vision tasks as a powerful preprocessing tool to identify human visual attention mechanisms that attract attention objects from natural images. It has many applications such as autopilot, robotic navigation, visual tracking, image retrieval, aesthetic assessment, and content-aware image editing. Inspired by progress in perceptual psychology, early models used heuristic prior and hand-made features such as contrast distance transforms. However, in complex scenarios, their detection performance is severely limited. Recent studies have demonstrated that deep learning techniques, particularly Convolutional Neural Networks (CNNs), are particularly good at extracting semantic features from image regions to understand visual concepts and achieve significant results.
The method adopts a deep learning semantic segmentation method to directly perform end-to-end (end-to-end) semantic segmentation at a pixel level, and can predict in a test set only by inputting images in a training set into a model frame for training to obtain weights and a model. The convolutional neural network is powerful in that its multi-layer structure can automatically learn features, and can learn features of multiple layers. Currently, methods based on deep learning semantic segmentation are divided into two types, the first is an encoding-decoding architecture. In the coding process, position information is gradually reduced and abstract features are extracted through a pooling layer; the decoding process gradually recovers the location information. There is typically a direct link between decoding and encoding. The second framework is a punctured convolution (pooled layers), a sensing domain is expanded in a punctured convolution mode, a smaller punctured convolution sensing domain is smaller, and specific characteristics of some parts are learned; the larger value of the coiled layer with holes has a larger sensing domain, more abstract characteristics can be learned, and the abstract characteristics have better robustness to the size, the position, the direction and the like of an object.
Most of the existing salient body detection methods adopt a deep learning method, and a large number of models are combined by utilizing a convolution layer and a pooling layer. Depth information can provide important supplementary cues to identify objects in complex scenes for significance. With the rapid development of imaging technology, the acquisition of depth maps becomes more convenient, and the research on RGB-D significance detection is promoted. Furthermore, depth maps contain many useful attributes, such as the shape of the convex body, contours, and geometric spatial information objects, which can be considered relevant clues for RGB-D saliency.
Disclosure of Invention
In view of the above, the present invention provides a method for detecting a salient object based on multi-strategy and cross feature fusion.
In order to achieve the purpose, the invention adopts the following technical scheme:
a salient object detection method based on multi-strategy and cross feature fusion comprises the following steps:
selecting RGB color images, Depth images and Ground Truth label images of a plurality of data sets to form a training set;
constructing a convolutional neural network, wherein the convolutional neural network adopts a top-down high-level feature supervision low-level feature fusion mode;
inputting the training set into the convolutional neural network, and training the convolutional neural network;
and training for multiple times to obtain a convolutional neural network model.
Preferably, the convolutional neural network introduces a depth optimization module to improve the image quality, and the feature maps obtained by the multi-strategy fusion module are subjected to cross fusion by the cross fusion module to capture the combined features.
Preferably, the depth optimization module has the following structure:
the first maximum pooling layer, the first rolling block, the first activation layer, the second rolling block and the second activation layer are sequentially connected and then are subjected to pixel multiplication with the first maximum pooling layer and then are input into the second maximum pooling layer, the second maximum pooling layer is sequentially connected with the third rolling block and the third activation layer, the output of the third activation layer is subjected to pixel multiplication with the second maximum pooling layer and then is input into the third maximum pooling layer, and the output of the third maximum pooling layer and the output of the first maximum pooling layer are subjected to pixel addition to form final output.
Preferably, the multi-strategy fusion module performs pixel subtraction, pixel addition and pixel multiplication operations on the depth feature and the RGB feature respectively, and takes an average value and a maximum value on a channel dimension; subtracting pixels, adding pixels, performing pixel multiplication operation and performing pixel addition on the average value and the maximum value on the channel dimension to obtain a first output; and the upper layer of fusion features are subjected to pixel addition with the first output after being subjected to upsampling to be used as final output.
Preferably, the structure of the cross-fusion module is as follows:
second inputBy feature extraction and first inputThe result of the pixel addition is recorded as Output via the first convolution block andperforming pixel addition to obtain M, performing pixel addition on M and M, using the result of pixel addition as the input of pixel multiplication with M, using the result of pixel multiplication as the input of pixel subtraction with M, using the result of pixel subtraction as the input of channel superposition with M, and performing second convolution on the output of channel superpositionAnd finally outputting the block.
Compared with the prior art, the method for detecting the salient object based on the multi-strategy and cross feature fusion has the following beneficial effects that:
1) the method comprises the steps of constructing a convolutional neural network, inputting RGB-D images in a training set into the convolutional neural network for training, and obtaining a convolutional neural network classification training model; and inputting the image to be subjected to significance detection into a convolutional neural network classification training model, and predicting to obtain a predicted significance image corresponding to the RGB image.
2) The method adopts a cross feature fusion module to perform cross fusion on the feature graphs of the multi-strategy fusion module, captures the joint features and provides supplementary information for the single-mode features.
3) The method adopts the depth optimization module to eliminate the influence of the noise of the depth information on the network, so that the obtained depth information can better express the position information of the salient body.
4) The method adopts a bidirectional cooperation structure, adopts top-down supervision and bottom-up decoding, and refines global features to regional features for final prediction.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a block diagram of an overall implementation of the method of the present invention;
FIG. 2 is a cross-fusion module architecture of the present invention;
FIG. 3 is a block diagram of a depth optimization module according to the present invention;
FIG. 4 is a block diagram of a multi-policy fusion module according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a method for detecting a salient object based on multi-strategy fusion and multi-supervision, and the overall implementation block diagram is shown in figure 1, and the method comprises a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: selecting Q NJU2K and RGB color images, Depth images and Ground Truth label images of an NLPR data set, forming a training set, and recording the Q-th original obvious detection image in the training set as { I }q(I, j) }, the training set is summed with { I }q(i, j) } the corresponding real label image is recorded asThen, the real significance detection image corresponding to each original significance image in the training set is processed into 1 single-hot coding image by adopting the existing single-hot coding technology (one-hot), and the 1 single-hot coding image is obtainedThe processed set of 1 one-hot coded image is denoted asWherein, the road scene image is an RGB color image, Q is a positive integer, Q is more than or equal to 200, if Q is 2185, Q is a positive integer, Q is more than or equal to 1 and less than or equal to Q, i is more than or equal to 1 and less than or equal to W, j is more than or equal to 1 and less than or equal to H, W represents a hard faceIq(I, j) }, H denotes { I }q(I, j) } e.g. take W224, H224, Iq(I, j) represents { IqThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),to representThe middle coordinate position is the pixel value of the pixel point of (i, j); here, 2185 images in the saliency detection image database NJU2K and the NLPR training set were selected directly.
Step 1_ 2: constructing a convolutional neural network: the convolutional neural network is divided into an encoding (Encode) part and a decoding (Decode) part, and respectively corresponds to Feature extraction (Feature Extract) and Feature Fusion (Feature Fusion) of an image. Fig. 2 is a cross fusion module structure diagram, fig. 3 is a depth optimization module structure diagram, and fig. 4 is a multi-strategy fusion module structure diagram.
The input is combined by two different modes of RGB (three-channel) and Depth (single-channel), so that the network input is divided into two streams, and RGB and Depth are respectively encoded. Since Depth information (Depth) contains spatial information between regions of an image, it plays a very important role in salient object detection, but Depth maps are usually of low quality, possibly introducing characteristic noise and redundancy into the network, and we introduce a Depth-optimization module (Depth-optimization Model). The backbone network employs ResNet-50. RGB and Depth codes are each made up of 5 convolutional blocks. Wherein the 1 st convolution block, the 2 nd convolution block, the 3 rd convolution block are defined as low-level features, the 4 th convolution block, and the 5 th convolution block are defined as high-level features, and the 6 th convolution block, the 7 th convolution block, and the 8 th convolution block in the same depth stream are defined as low-level features, and the 9 th convolution block and the 10 th convolution block are defined as high-level features. There are 5 multi-policy Fusion modules (Muti-stream Fusion) between the two encoded main streams, which use high-level features to supervise low-level feature Fusion, in a top-down manner. Each MSF has a supervision output by upsampling (Upsample) as a supervision loss during training. And performing Cross Feature Fusion (CFF) with the outputs of the 2 nd multi-strategy fusion module, the 3 rd multi-strategy fusion module, the 4 th multi-strategy fusion module and the 5 th multi-strategy fusion module through the first MSF module. Where the input pictures of both encoded streams are both W wide and H high.
For the RGB color image training layer and the Depth single-channel image pre-training layer, the ResNet50 pre-trained on Imagenet is adopted, and the total output is five. The first output layer of the RGB color image pre-training layer has the size of W/2 and the height of H/2, and 64 feature maps are recorded as R1; the second output layer of the RGB color image pre-training layer has the size of W/4 and the height of H/4, and has 256 characteristic graphs which are marked as R2; the size of a third output layer of the RGB color image pre-training layer is W/8, the height of the third output layer is H/8, 512 feature maps are provided in total and are marked as R3; the size of a fourth output layer of the RGB color image pre-training layer is W/16, the height of the fourth output layer is H/16, and 1024 feature maps are recorded as R4; the fifth output layer of the left-view color image pre-training layer has the size of W/32 and the height of H/32, and has 1024 characteristic graphs which are marked as R5; the Depth image pre-training layer has five outputs, which are recorded as D1, D2, D3, D4 and D5, and the structures are respectively the same as R1, R2, R3, R4 and R5.
For the 5 th convolution block, the 6 th convolution block, the 7 th convolution block, the 8 th convolution block, the 9 th convolution block and the 10 th convolution block, the output of each convolution block from the previous layer of convolution block is input to the depth optimization modules DOM1, DOM2, DOM3, DOM4 and DOM5, and D2, D3, D4 and D5 are obtained.
Input D of the deep optimization Module DOMi(Ci×Hi×Wi)(i=1,2,3,4,5),CiDenotes the number of channels, Hi,WiRepresenting the length and width of the image, respectively. Channel Attention (Channel Attention) is first performed, where the main branches are in turn grouped by the first largest pooling layer, and the size of the output depth map is 1 × 1. The first convolution block, convolution kernel size 1 × 1, step size 1, number of channels CiA first active layer (Relu), a second convolution block, a convolution kernel size of 1 × 1, a step size of 1, a number of channels CiSecond activation ofLayer (Sigmoid), then the main branch and the shortcut branch are multiplied by pixel to obtainThen Spatial Attention (Spatial Attention) is performed, wherein the main branch is sequentially formed by a first maximization layer (Maximize), a third convolution block, a convolution kernel size of 7 × 7, a step size of 1, Padding (Padding) of 3, and a third activation layer of Sigmoid, and then the channel Attention is obtainedMultiplied by the spatial attention output to obtainFinally, the original input D is inputiAndadd operation is performed as input to the next volume block.
Step 1_ 3: for the fifth multi-strategy fusion module, the outputs of the fifth convolution module (RGB color feature R5) and the 5 th Depth optimization module (Depth feature D5) are used as inputs, pixel subtraction, pixel addition and pixel multiplication are respectively carried out, the maximum value of the channel and the average value of the channel are taken, and Q is obtained1,Q2,Q3,Q4,Q5Then respectively adding Qi(i ═ 1,2,3,4, 5) are added as the fusion features input by the next-layer multi-strategy fusion module, and for the 4 th multi-strategy fusion module, the 3 rd multi-strategy fusion module, the 2 nd multi-strategy fusion module, and the 1 st multi-strategy fusion module, the 4 th convolution block, the 3 rd convolution block, the 2 nd convolution block, the 1 st convolution block (R4, R3, R2, R1) and the 4 th depth optimization module, the 3 rd depth optimization module, the 2 nd depth optimization module, the 1 st depth optimization module (D4, D3, D2, D1) and the fusion features of the previous-layer multi-strategy fusion feature module are input, respectively. Will Di(i ═ 1,2,3,4) and Ri(i ═ 1,2,3,4), pixel subtraction, pixel addition, pixel multiplication, channel maximization, and channel leveling, respectivelyMean value to obtain Q1,Q2,Q3,Q4,Q5Then, the fusion characteristics of the multi-strategy fusion module of the upper layer are sampled by 2 times to obtain Fi(i ═ 1,2,3,4) and finally Q1,Q2,Q3,Q4,Q5And FiAnd adding the fusion characteristics as the input fusion characteristics of the next layer of multi-strategy fusion module.
For the 4 th cross fusion module, the 3 rd cross fusion module, the 2 nd cross fusion module and the 1 st cross fusion module, the input of the first multi-strategy fusion module is respectively outputAnd 5, 4, 3, 2 multiple strategy fusion module. Firstly, the ith (i-2, 3,4,5) multi-strategy fusion output is processed by 2i-1Performing multiple upsampling, extracting features, determining the convolution kernel size of the convolution layer to be 3 × 3, the step length to be 1, the padding to be 1, the output channel to be 64, then performing standardization (Batch Norm), and finally performing activation (Rectified Linear Unit, ReLU) to obtainWill be provided withAndresult of additionPerforming a first convolution with a convolution kernel size of 3 × 3, a step size of 1, and a padding of 1 to obtainThen will be And adding, respectively adding with the self, multiplying, subtracting and taking the obtained characteristics as the operation objects of the next step, and finally carrying out Concat on the obtained result and the self. And in the second convolution block, the convolution kernel size is 1, the step size is 1, and the output is 64 channels.
Step 1_ 4: and performing data enhancement on each original RGB color image and Depth image in the training set by means of random cutting, rotation, color enhancement, overturning and the like, and then taking the images as initial input images, wherein the batch size is 4. Inputting the prediction images into a deep convolution neural network for training to obtain a prediction image with each original saliency image in a training set equal to the original size, and in addition, in order to assist the training, outputting 5 multi-strategy fusion modules during the trainingThe sizes are W/2H/2, W/4H/4, W/8H/8, W/16H/16 and W/32H/32 in turn, and 2 is subjected to upsamplingiMultiplying to obtain the characteristics with H x W and the final output M of the modeloutSupervise training together, willMoutAnd MGTThe LOSS function between (true values) is noted LOSS (M)pre,MGT) The LOSS adopts a Binary Cross Entropy LOSS function (Binary Cross Entropy LOSS) and finally sums 6 losses to obtain a final LOSS value.
Step 1_ 5: repeatedly executing the step 1_4 for N times until the neural network converges on the training set, and taking 800 original RGB color images and Depth images as a verification set during the training period to obtain N loss function values in total; then finding out the loss function value with the minimum value from the N loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, and correspondingly marking as WbestAnd bbest(ii) a Where N > 1, in this example, N is 300.
The test stage process comprises the following specific steps:
step 2_ 1: the set of NJU2K data sets for 500 original RGB color images and Depth images and the set of NLPR data for 300 original RGB color images and Depth images were taken as the test set. Order toRepresenting a saliency image to be detected; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' representsWidth of (A), H' representsThe height of (a) of (b),to representAnd the middle coordinate position is the pixel value of the pixel point of (i, j). No data enhancement was performed at the time of testing.
Step 2_ 2: will be provided withThe R channel component, the G channel component and the B channel component are input into a convolutional neural network classification training model and are subjected to W-based classificationbestAnd bbestMaking a prediction to obtainCorresponding predictive semantic segmentation image, denotedWherein the content of the first and second substances,to representMiddle coordinate positionAnd setting the pixel value of the pixel point of (i ', j').
To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.
And (3) building a convolutional neural network architecture by using a python-based deep learning library Pythrch. The significance detection database NJU2K and the test set of NLPR are adopted to analyze how the segmentation effect of the significance detection image (500 NJU2K images and 300 NLPR images) obtained by prediction by the method is. The average Absolute Error (MAE) of the target detection effect of the method, F1 Score (F1 Score, F1), Structure measurement (S-measure), and Enhanced positioning measurement (E-measure) are used for evaluating the detection performance of the significance detection image, as listed in Table 1. From the data listed in table 1, the significant object images obtained by the method of the present invention are good, which indicates that it is feasible and effective to obtain significant object images of various scenes by using the method of the present invention.
TABLE 1 evaluation results on test sets using the method of the invention
ours | S↑ | adpE↑ | adpF↑ | MaxF↑ | MAE↓ |
NJU2K | 0.912 | 0.932 | 0.915 | 0.917 | 0.032 |
NLPR | 0.920 | 0.958 | 0.904 | 0.912 | 0.022 |
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (5)
1. A salient object detection method based on multi-strategy and cross feature fusion is characterized by comprising the following steps:
selecting RGB color images, Depth images and GroudTruth label images of a plurality of data sets to form a training set;
constructing a convolutional neural network, wherein the convolutional neural network adopts a top-down high-level feature supervision low-level feature fusion mode;
inputting the training set into the convolutional neural network, and training the convolutional neural network;
and training for multiple times to obtain a convolutional neural network model.
2. The method for detecting the salient object based on the multi-strategy and cross-feature fusion as claimed in claim 1, wherein the convolutional neural network introduces a depth optimization module to improve the image quality, and the feature maps obtained by the multi-strategy fusion module are cross-fused by the cross-fusion module to capture the combined features.
3. The method for detecting the salient object based on the multi-strategy and cross feature fusion as claimed in claim 2, wherein the depth optimization module has the following structure:
the first maximum pooling layer, the first rolling block, the first activation layer, the second rolling block and the second activation layer are sequentially connected and then are subjected to pixel multiplication with the first maximum pooling layer and then are input into the second maximum pooling layer, the second maximum pooling layer is sequentially connected with the third rolling block and the third activation layer, the output of the third activation layer is subjected to pixel multiplication with the second maximum pooling layer and then is input into the third maximum pooling layer, and the output of the third maximum pooling layer and the output of the first maximum pooling layer are subjected to pixel addition to form final output.
4. The method for detecting the salient object based on the multi-strategy and cross-feature fusion as claimed in claim 2, wherein the multi-strategy fusion module performs pixel subtraction, pixel addition and pixel multiplication operations on the depth feature and the RGB feature respectively, and takes an average value and a maximum value on a channel dimension; subtracting pixels, adding pixels, performing pixel multiplication operation and performing pixel addition on the average value and the maximum value on the channel dimension to obtain a first output; and the upper layer of fusion features are subjected to pixel addition with the first output after being subjected to upsampling to be used as final output.
5. The method for detecting the salient object based on the multi-strategy and cross feature fusion as claimed in claim 2, wherein the structure of the cross fusion module is as follows:
second inputBy feature extraction and first inputThe result of the pixel addition is recorded as Output via the first convolution block andand performing pixel addition to obtain M, performing pixel addition on the M and the M, using a pixel addition result as an input for performing pixel multiplication on the M, using a pixel multiplication result as an input for performing pixel subtraction on the M, using a pixel subtraction result as an input for performing channel superposition on the M, and using an output of the channel superposition as a final output after passing through a second convolution block.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110743443.XA CN113313077A (en) | 2021-06-30 | 2021-06-30 | Salient object detection method based on multi-strategy and cross feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110743443.XA CN113313077A (en) | 2021-06-30 | 2021-06-30 | Salient object detection method based on multi-strategy and cross feature fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113313077A true CN113313077A (en) | 2021-08-27 |
Family
ID=77381578
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110743443.XA Withdrawn CN113313077A (en) | 2021-06-30 | 2021-06-30 | Salient object detection method based on multi-strategy and cross feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113313077A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114445442A (en) * | 2022-01-28 | 2022-05-06 | 杭州电子科技大学 | Multispectral image semantic segmentation method based on asymmetric cross fusion |
CN115796244A (en) * | 2022-12-20 | 2023-03-14 | 广东石油化工学院 | CFF-based parameter identification method for super-nonlinear input/output system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110619638A (en) * | 2019-08-22 | 2019-12-27 | 浙江科技学院 | Multi-mode fusion significance detection method based on convolution block attention module |
CN111242181A (en) * | 2020-01-03 | 2020-06-05 | 大连民族大学 | RGB-D salient object detector based on image semantics and details |
CN112149662A (en) * | 2020-08-21 | 2020-12-29 | 浙江科技学院 | Multi-mode fusion significance detection method based on expansion volume block |
CN112529862A (en) * | 2020-12-07 | 2021-03-19 | 浙江科技学院 | Significance image detection method for interactive cycle characteristic remodeling |
-
2021
- 2021-06-30 CN CN202110743443.XA patent/CN113313077A/en not_active Withdrawn
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110619638A (en) * | 2019-08-22 | 2019-12-27 | 浙江科技学院 | Multi-mode fusion significance detection method based on convolution block attention module |
CN111242181A (en) * | 2020-01-03 | 2020-06-05 | 大连民族大学 | RGB-D salient object detector based on image semantics and details |
CN112149662A (en) * | 2020-08-21 | 2020-12-29 | 浙江科技学院 | Multi-mode fusion significance detection method based on expansion volume block |
CN112529862A (en) * | 2020-12-07 | 2021-03-19 | 浙江科技学院 | Significance image detection method for interactive cycle characteristic remodeling |
Non-Patent Citations (1)
Title |
---|
SANGHYUN WOO等: "CBAM: Convolutional Block Attention Module", 《计算机视觉-ECCV2018》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114445442A (en) * | 2022-01-28 | 2022-05-06 | 杭州电子科技大学 | Multispectral image semantic segmentation method based on asymmetric cross fusion |
CN115796244A (en) * | 2022-12-20 | 2023-03-14 | 广东石油化工学院 | CFF-based parameter identification method for super-nonlinear input/output system |
CN115796244B (en) * | 2022-12-20 | 2023-07-21 | 广东石油化工学院 | Parameter identification method based on CFF for ultra-nonlinear input/output system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210390700A1 (en) | Referring image segmentation | |
CN111723732B (en) | Optical remote sensing image change detection method, storage medium and computing equipment | |
CN110889449A (en) | Edge-enhanced multi-scale remote sensing image building semantic feature extraction method | |
CN113850825A (en) | Remote sensing image road segmentation method based on context information and multi-scale feature fusion | |
CN107871014A (en) | A kind of big data cross-module state search method and system based on depth integration Hash | |
CN112966684A (en) | Cooperative learning character recognition method under attention mechanism | |
CN110246148B (en) | Multi-modal significance detection method for depth information fusion and attention learning | |
CN112418212B (en) | YOLOv3 algorithm based on EIoU improvement | |
CN113780149A (en) | Method for efficiently extracting building target of remote sensing image based on attention mechanism | |
CN110929736A (en) | Multi-feature cascade RGB-D significance target detection method | |
CN111738054B (en) | Behavior anomaly detection method based on space-time self-encoder network and space-time CNN | |
CN110705566B (en) | Multi-mode fusion significance detection method based on spatial pyramid pool | |
CN109461177B (en) | Monocular image depth prediction method based on neural network | |
CN116994140A (en) | Cultivated land extraction method, device, equipment and medium based on remote sensing image | |
CN113313077A (en) | Salient object detection method based on multi-strategy and cross feature fusion | |
CN112991364A (en) | Road scene semantic segmentation method based on convolution neural network cross-modal fusion | |
CN111915618B (en) | Peak response enhancement-based instance segmentation algorithm and computing device | |
CN111523463B (en) | Target tracking method and training method based on matching-regression network | |
CN113192073A (en) | Clothing semantic segmentation method based on cross fusion network | |
CN113487600B (en) | Feature enhancement scale self-adaptive perception ship detection method | |
CN113269224A (en) | Scene image classification method, system and storage medium | |
CN112529862A (en) | Significance image detection method for interactive cycle characteristic remodeling | |
CN114170623A (en) | Human interaction detection equipment and method and device thereof, and readable storage medium | |
CN114926734B (en) | Solid waste detection device and method based on feature aggregation and attention fusion | |
Chen et al. | MSF-Net: A multiscale supervised fusion network for building change detection in high-resolution remote sensing images |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20210827 |
|
WW01 | Invention patent application withdrawn after publication |