CN115546512A

CN115546512A - Light field image salient object detection method based on learnable weight descriptor

Info

Publication number: CN115546512A
Application number: CN202211047306.3A
Authority: CN
Inventors: 刘政怡; 何倩; 檀亚诚
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2022-12-30

Abstract

The invention discloses a method for detecting a light field image salient target based on a learnable weight descriptor, which comprises the following steps: s1, respectively extracting a full-focus feature and a focus stack feature from a light field image; s2, weighting the focus stack characteristics to generate enhanced focus stack characteristics; the weight is realized by a Transformer decoder through a weight descriptor which can be learned; s3, hierarchically interacting the full focus features and the enhanced focus stack features to generate multi-modal fusion features; s4, decoding the multi-modal fusion features to generate a saliency map; s5, monitoring the saliency map by using a saliency map truth value, and forming a light field image saliency target detection model through training of a training set; and S6, detecting any one light field image by using the light field image salient object detection model, and outputting a salient image as a detection result through the steps S1-S4. According to the method, inter-focal-slice weighting and spatial and channel weighting are carried out on the characteristics of the focal stack through the learnable weight descriptors so as to obtain more effective information, and through layered multi-modal fusion, sufficient interaction of information among multiple modes is promoted, effective fusion is realized, and detection precision is improved.

Description

Light field image salient object detection method based on learnable weight descriptor

Technical Field

The invention relates to the field of computer vision, in particular to a method for detecting a significant target of a light field image based on a learnable weight descriptor.

Background

The lightfield image is composed of a fully focused image and a focal stack containing a series of images focused at different depths. The full focus image and the focal stack correspond to different modalities of the same scene, with the full focus modality emphasizing appearance and overall information and the focal stack modality emphasizing geometry and region information. The method has the advantages that firstly, a series of images focused at different depths in a focal stack are required to be effectively fused to supplement respective information, and secondly, two modal information of a full-focus image and the focal stack are required to be effectively fused to reduce the difference between the two modal information, maximize the common point of the two modal information, and provide better characteristics for the decoding stage of the significant target.

Disclosure of Invention

The invention aims to solve the technical problem of providing a light field image salient object detection method based on a learnable weight descriptor, wherein the learnable weight descriptor is used for weighting the characteristics of a focus stack among focal slices and on a space and a channel so as to obtain more effective information; through layered multi-modal fusion, the difference between a full focus mode and a focus stack mode is reduced, and the detection performance is improved.

The technical scheme adopted by the invention is as follows:

a method for detecting salient objects in light field images based on learnable weight descriptors, the method comprising the steps of:

s1, respectively extracting a full focusing feature and a focus stack feature from a light field image;

s2, weighting the focus stack characteristics to generate enhanced focus stack characteristics; the weight is realized by a Transformer decoder through a weight descriptor which can be learned;

s3, hierarchically interacting the full focusing features and the enhanced focus stack features to generate multi-modal fusion features;

s4, decoding the multi-modal fusion features to generate a saliency map;

s5, monitoring the saliency map by using a saliency map truth value, and forming a light field image saliency target detection model through training of a training set;

and S6, detecting any one light field image by using the light field image salient object detection model, and outputting a salient image as a detection result through the steps S1-S4.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a light field image salient object detection method based on a learnable weight descriptor, which is characterized in that inter-focal-slice weighting and space and channel weighting are carried out on the characteristics of a focal stack through the learnable weight descriptor so as to obtain more effective information, and through layered multi-modal fusion, sufficient interaction of information among multiple modes is promoted, effective fusion is realized, and the detection precision is improved.

Drawings

FIG. 1 is a flow chart of a method for detecting a salient object in a light field image based on a learnable weight descriptor according to the present invention;

the present invention will be further described with reference to the following detailed description and accompanying drawings, but the embodiments of the invention are not limited thereto.

Detailed Description

The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.

The embodiment of the invention provides a method for detecting a significant target of a light field image based on a learnable weight descriptor, which comprises the following steps:

s1, respectively extracting a full-focus feature and a focus stack feature from a light field image;

s2, weighting the focal stack characteristics to generate enhanced focal stack characteristics; the weight is realized by a Transformer decoder through a weight descriptor which can be learned;

s4, decoding the multi-modal fusion features to generate a saliency map;

Further, in the step S1, the method for extracting the all-focus feature and the focus stack feature is to extract the all-focus feature by using two Pyramid Vision Transformer neural network models pre-trained on ImageNet respectively

And focal Stack features

Where the subscript k represents the number of focal slices in the focal stack, from 1 to 12,l represents the number of layers, corresponding to the number of layers of the Pyramid Vision transducer, and l is a natural number from 1 to 4.

Further, in the step S2, weighting the focal stack features to generate enhanced focal stack features; the weight is realized by a Transformer decoder through a weight descriptor which can be learned; the specific operation is as follows:

s2.1: the full focus feature and the focus stack feature obtained in step S1 have different resolutions and channel numbers, and in order to reduce the amount of computation, the dilated convolution is used to increase the receptive field and compress the channels into 32, which is specifically described as:

the above-mentioned

Represents the characteristics of the kth focal plate of the l layer after the expansion convolution operation and the channel compression, l is from 1 to 4,k from 0 to 12, and represents that RFB (-) operates on the full focus characteristic and the focus stack characteristic; RFB (. Cndot.) operation refers to the module of increasing the Receptive field composed of dilated convolutions of different dilation rates proposed in the paper "Receptive field block net for acurate and fast object detection";

s2.2: defining a weight descriptor Q, and learning the weights of different focal slices, regions and channels by using Q; taking the characteristics of a 4-layer focus stack after channel compression as K and V, and sending the K and V and the designed Q into a transform decoder together, wherein the characteristics are specifically described as follows:

the above-mentioned

Represents the learned weight descriptor, Q _p A query point representing a query Q is shown,

is a multi-layer focus stack characteristic after channel compression, MSDeformAttn (·) operation refers to the paper "Deformable DETR: a multi-scale Deformable transform decoder as set forth in Deformable transforms for End-to-End Object Detection;

s2.3: the multi-layer focus stack characteristics after the channel compression and the learned weight descriptors are described

Element-by-element multiplication is performed to weight the focal plate and the regions and channels respectively, thereby forming an enhanced multi-layer focal stack feature, which is described in detail as:

the above-mentioned

Representing a 4-layer enhanced focal Stack feature, the Reshape (·) operation represents the weight descriptor

Recovering the feature from the sequence when

Up-sampling operation Up (-) pair is used when the resolution of the multi-layer focus stack feature is not consistent

Implementing an increase resolution, "×" refers to element-by-element multiplication operation;

s2.4: cascading the 12 focal stack features of each layer to form an enhanced focal stack feature, which is specifically described as:

said H ^l Represents a 4-layer enhanced focus stack feature, the Concat (·) operation represents a cascading operation;

s2.5: in order to make the model learn the correct weight descriptor, the enhanced focus stack features of each layer are supervised by a saliency map value GT, which is specifically described as:

the Pred (-) operation represents the prediction header that produces the saliency map, loss (-) refers to the paper F ³ Net: pixel position perception loss function as proposed in Fusion, feed and Focus for sale Object Detection.

Further, in step S3, the full focus feature and the enhanced focus stack feature are hierarchically interacted to generate a multi-modal fusion feature, which specifically operates as follows:

s3.1: flattening operation is carried out on the full focusing features and the enhanced focus stack features of two high layers, then block-level cascade operation is carried out, a cascaded sequence is sent into a multi-head self-attention layer, the full focusing features and the enhanced focus stack features are fully interacted, the interacted features are separated and restored to be in the original resolution size, and finally the two interacted and enhanced features are added, wherein the specific description is as follows:

the Dl (l =3,4) represents the fusion feature of the multi-modal of the two high layers, the Merge (·) operation represents the cascade of the block level, the MHSA (·) operation represents the multi-head self-attention layer, the Split (·) operation represents the splitting of the block sequence into two from the middle, and the Sum (·) operation represents the element-by-element addition of the Split-two block sequence;

s3.2: performing global maximum pooling, convolution and activation operation on the full-focus features of the lower two layers to obtain spatial weight, multiplying the enhanced focus stack features and the spatial weight obtained through the full-focus features element by element, and finally performing residual connection, wherein the specific description is as follows:

the Dl (l =1,2) represents the fusion feature of the lower two-layered multi-modality, the P (·) operation represents the global maximum pooling of channel dimensions, conv (·) represents the convolutional layer, σ (·) represents the activation function, and "+" refers to the element-by-element addition operation.

Further, in step S4, the fused features of the multiple modalities are decoded to generate a saliency map, which specifically operates as follows:

s4.1: the multi-modal fusion features are up-sampled and added layer by layer, and the method is specifically described as follows:

D＝Up ₂ (Up ₂ (Up ₂ (D ⁴ )+D ³ )+D ² )+D ¹

the D represents the final feature, up, after fusing the fused features of the four-layer multi-mode ₂ (. Cndot.) represents two-fold upsampling;

s4.2: and (3) performing convolution, activation and upsampling on the final characteristic D with an output channel of 1 to restore to the original input image size, specifically describing as follows:

S＝Sig(Up ₄ (Conv(D)))

said S represents the saliency map, up ₄ (. Cndot.) represents a quadruple upsampling and Sig (. Cndot.) represents a Sigmoid activation function.

Further, in step S5, using a saliency map true value to supervise the saliency map, and forming a light field image saliency target detection model through training of a training set; the training set used 1000 pictures on the DUTLF-FS dataset and 100 pictures on the HFUT-Lytro dataset, and the supervision used cross-entropy loss.

Further, in step S6, any one light field image is detected by using the light field image salient object detection model, and through steps S1 to S4, a salient image is output as a detection result; the test set employed the DUTLF-FS dataset and the other pictures and LFSD datasets on the HFUT-Lytro dataset except the training set.

The method is compared with two RGB, two RGB-D and eight light field image significant target detection methods PoolNet [1], PGNet [2], BBSNet [3], swinNet [4], moLF [5], DLFS [6], LFNet [7], ERNet [8], SA-Net [9], DLGLRG [10], PANET [11], MEANet [12], and the result is shown in Table 1.

TABLE 1 results of the experiment

[1]Liu,J.-J.；Hou,Q.；Cheng,M.-M.；Feng,J.；and Jiang,J.2019.A Simple Pooling-Based Design for Real-Time Salient Object Detection.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,3917–3926.

[2]Xie,C.；Xia,C.；Ma,M.；Zhao,Z.；Chen,X.；and Li,J.2022.Pyramid Grafting Network for One-Stage High Resolution Saliency Detection.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,11717–11726.

[3]Fan,D.-P.；Zhai,Y.；Borji,A.；Yang,J.；and Shao,L.2020.BBS-Net:RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network.In European conference on computer vision,275–292.Springer.

[4]Liu,Z.；Tan,Y.；He,Q.；and Xiao,Y.2022.SwinNet:Swin Transformer Drives Edge-Aware RGB-Dand RGB-T Salient Object Detection.IEEE Transactions on Circuits and Systems for Video Technology,32(7):4486–4497.

[5]Zhang,M.；Li,J.；Wei,J.；Piao,Y.；and Lu,H.2019.Memory-Oriented Decoder for Light Field Salient Object Detection.Advances in Neural Information Processing Systems,32:1–11.

[6]Piao,Y.；Rong,Z.；Zhang,M.；Li,X.；and Lu,H.2019.Deep Light-Field-Driven Saliency Detection From a Single View.In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence,904–911.

[7]Zhang,M.；Ji,W.；Piao,Y.；Li,J.；Zhang,Y.；Xu,S.；and Lu,H.2020b.LFNet:Light Field Fusion Network for Salient Object Detection.IEEE Transactions on Image Processing,29:6276–6287.

[8]Piao,Y.；Rong,Z.；Zhang,M.；and Lu,H.2020.Exploit and Replace:An Asymmetrical Two-Stream Architecture for Versatile Light Field Saliency Detection.In Proceedings of the AAAI Conference on Artificial Intelligence,11865–11873.

[9]Zhang,Y.；Chen,G.；Chen,Q.；Sun,Y.；Xia,Y.；Deforges,O.；Hamidouche,W.；and Zhang,L.2021c.Learning Synergistic Attention for Light Field Salient Object Detection.In Procedings of British Machine Vision Conference,1–14.

[10]Liu,N.；Zhao,W.；Zhang,D.；Han,J.；and Shao,L.2021a.Light Field Saliency Detection with Dual Local Graph Learning and Reciprocative Guidance.In Proceedings of the IEEE/CVF International Conference on Computer Vision,4712–4721.

[11]Piao,Y.；Jiang,Y.；Zhang,M.；Wang,J.；and Lu,H.2021.PANet:Patch-Aware Network for Light Field Salient Object Detection.IEEE Transactions on Cybernetics,1–13.

[12]Jiang,Y.；Zhang,W.；Fu,K.；and Zhao,Q.2022.Meanet:Multi-Modal Edge-Aware Network for Light Field Salient Object Detection.Neurocomputing,491:78–90.

As shown in Table 1, the method of the embodiment of the invention obtains the optimal result on the evaluation indexes of S-measure, adaptive F-measure, adaptive E-measure and MAE.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. The invention discloses a light field image salient object detection method based on a learnable weight descriptor, which comprises the following steps:

s4, decoding the multi-modal fusion features to generate a saliency map;