CN115546512A - Light field image salient object detection method based on learnable weight descriptor - Google Patents

Light field image salient object detection method based on learnable weight descriptor Download PDF

Info

Publication number
CN115546512A
CN115546512A CN202211047306.3A CN202211047306A CN115546512A CN 115546512 A CN115546512 A CN 115546512A CN 202211047306 A CN202211047306 A CN 202211047306A CN 115546512 A CN115546512 A CN 115546512A
Authority
CN
China
Prior art keywords
light field
field image
features
focus
stack
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211047306.3A
Other languages
Chinese (zh)
Inventor
刘政怡
何倩
檀亚诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202211047306.3A priority Critical patent/CN115546512A/en
Publication of CN115546512A publication Critical patent/CN115546512A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for detecting a light field image salient target based on a learnable weight descriptor, which comprises the following steps: s1, respectively extracting a full-focus feature and a focus stack feature from a light field image; s2, weighting the focus stack characteristics to generate enhanced focus stack characteristics; the weight is realized by a Transformer decoder through a weight descriptor which can be learned; s3, hierarchically interacting the full focus features and the enhanced focus stack features to generate multi-modal fusion features; s4, decoding the multi-modal fusion features to generate a saliency map; s5, monitoring the saliency map by using a saliency map truth value, and forming a light field image saliency target detection model through training of a training set; and S6, detecting any one light field image by using the light field image salient object detection model, and outputting a salient image as a detection result through the steps S1-S4. According to the method, inter-focal-slice weighting and spatial and channel weighting are carried out on the characteristics of the focal stack through the learnable weight descriptors so as to obtain more effective information, and through layered multi-modal fusion, sufficient interaction of information among multiple modes is promoted, effective fusion is realized, and detection precision is improved.

Description

Light field image salient object detection method based on learnable weight descriptor
Technical Field
The invention relates to the field of computer vision, in particular to a method for detecting a significant target of a light field image based on a learnable weight descriptor.
Background
The lightfield image is composed of a fully focused image and a focal stack containing a series of images focused at different depths. The full focus image and the focal stack correspond to different modalities of the same scene, with the full focus modality emphasizing appearance and overall information and the focal stack modality emphasizing geometry and region information. The method has the advantages that firstly, a series of images focused at different depths in a focal stack are required to be effectively fused to supplement respective information, and secondly, two modal information of a full-focus image and the focal stack are required to be effectively fused to reduce the difference between the two modal information, maximize the common point of the two modal information, and provide better characteristics for the decoding stage of the significant target.
Disclosure of Invention
The invention aims to solve the technical problem of providing a light field image salient object detection method based on a learnable weight descriptor, wherein the learnable weight descriptor is used for weighting the characteristics of a focus stack among focal slices and on a space and a channel so as to obtain more effective information; through layered multi-modal fusion, the difference between a full focus mode and a focus stack mode is reduced, and the detection performance is improved.
The technical scheme adopted by the invention is as follows:
a method for detecting salient objects in light field images based on learnable weight descriptors, the method comprising the steps of:
s1, respectively extracting a full focusing feature and a focus stack feature from a light field image;
s2, weighting the focus stack characteristics to generate enhanced focus stack characteristics; the weight is realized by a Transformer decoder through a weight descriptor which can be learned;
s3, hierarchically interacting the full focusing features and the enhanced focus stack features to generate multi-modal fusion features;
s4, decoding the multi-modal fusion features to generate a saliency map;
s5, monitoring the saliency map by using a saliency map truth value, and forming a light field image saliency target detection model through training of a training set;
and S6, detecting any one light field image by using the light field image salient object detection model, and outputting a salient image as a detection result through the steps S1-S4.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a light field image salient object detection method based on a learnable weight descriptor, which is characterized in that inter-focal-slice weighting and space and channel weighting are carried out on the characteristics of a focal stack through the learnable weight descriptor so as to obtain more effective information, and through layered multi-modal fusion, sufficient interaction of information among multiple modes is promoted, effective fusion is realized, and the detection precision is improved.
Drawings
FIG. 1 is a flow chart of a method for detecting a salient object in a light field image based on a learnable weight descriptor according to the present invention;
the present invention will be further described with reference to the following detailed description and accompanying drawings, but the embodiments of the invention are not limited thereto.
Detailed Description
The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.
The embodiment of the invention provides a method for detecting a significant target of a light field image based on a learnable weight descriptor, which comprises the following steps:
s1, respectively extracting a full-focus feature and a focus stack feature from a light field image;
s2, weighting the focal stack characteristics to generate enhanced focal stack characteristics; the weight is realized by a Transformer decoder through a weight descriptor which can be learned;
s3, hierarchically interacting the full focusing features and the enhanced focus stack features to generate multi-modal fusion features;
s4, decoding the multi-modal fusion features to generate a saliency map;
s5, monitoring the saliency map by using a saliency map truth value, and forming a light field image saliency target detection model through training of a training set;
and S6, detecting any one light field image by using the light field image salient object detection model, and outputting a salient image as a detection result through the steps S1-S4.
Further, in the step S1, the method for extracting the all-focus feature and the focus stack feature is to extract the all-focus feature by using two Pyramid Vision Transformer neural network models pre-trained on ImageNet respectively
Figure BDA0003820853400000021
And focal Stack features
Figure BDA0003820853400000022
Where the subscript k represents the number of focal slices in the focal stack, from 1 to 12,l represents the number of layers, corresponding to the number of layers of the Pyramid Vision transducer, and l is a natural number from 1 to 4.
Further, in the step S2, weighting the focal stack features to generate enhanced focal stack features; the weight is realized by a Transformer decoder through a weight descriptor which can be learned; the specific operation is as follows:
s2.1: the full focus feature and the focus stack feature obtained in step S1 have different resolutions and channel numbers, and in order to reduce the amount of computation, the dilated convolution is used to increase the receptive field and compress the channels into 32, which is specifically described as:
Figure BDA0003820853400000031
the above-mentioned
Figure BDA0003820853400000032
Represents the characteristics of the kth focal plate of the l layer after the expansion convolution operation and the channel compression, l is from 1 to 4,k from 0 to 12, and represents that RFB (-) operates on the full focus characteristic and the focus stack characteristic; RFB (. Cndot.) operation refers to the module of increasing the Receptive field composed of dilated convolutions of different dilation rates proposed in the paper "Receptive field block net for acurate and fast object detection";
s2.2: defining a weight descriptor Q, and learning the weights of different focal slices, regions and channels by using Q; taking the characteristics of a 4-layer focus stack after channel compression as K and V, and sending the K and V and the designed Q into a transform decoder together, wherein the characteristics are specifically described as follows:
Figure BDA0003820853400000033
the above-mentioned
Figure BDA0003820853400000034
Represents the learned weight descriptor, Q p A query point representing a query Q is shown,
Figure BDA0003820853400000035
is a multi-layer focus stack characteristic after channel compression, MSDeformAttn (·) operation refers to the paper "Deformable DETR: a multi-scale Deformable transform decoder as set forth in Deformable transforms for End-to-End Object Detection;
s2.3: the multi-layer focus stack characteristics after the channel compression and the learned weight descriptors are described
Figure BDA0003820853400000036
Element-by-element multiplication is performed to weight the focal plate and the regions and channels respectively, thereby forming an enhanced multi-layer focal stack feature, which is described in detail as:
Figure BDA0003820853400000037
the above-mentioned
Figure BDA0003820853400000038
Representing a 4-layer enhanced focal Stack feature, the Reshape (·) operation represents the weight descriptor
Figure BDA0003820853400000039
Recovering the feature from the sequence when
Figure BDA00038208534000000310
Up-sampling operation Up (-) pair is used when the resolution of the multi-layer focus stack feature is not consistent
Figure BDA00038208534000000311
Implementing an increase resolution, "×" refers to element-by-element multiplication operation;
s2.4: cascading the 12 focal stack features of each layer to form an enhanced focal stack feature, which is specifically described as:
Figure BDA00038208534000000312
said H l Represents a 4-layer enhanced focus stack feature, the Concat (·) operation represents a cascading operation;
s2.5: in order to make the model learn the correct weight descriptor, the enhanced focus stack features of each layer are supervised by a saliency map value GT, which is specifically described as:
Figure BDA00038208534000000313
the Pred (-) operation represents the prediction header that produces the saliency map, loss (-) refers to the paper F 3 Net: pixel position perception loss function as proposed in Fusion, feed and Focus for sale Object Detection.
Further, in step S3, the full focus feature and the enhanced focus stack feature are hierarchically interacted to generate a multi-modal fusion feature, which specifically operates as follows:
s3.1: flattening operation is carried out on the full focusing features and the enhanced focus stack features of two high layers, then block-level cascade operation is carried out, a cascaded sequence is sent into a multi-head self-attention layer, the full focusing features and the enhanced focus stack features are fully interacted, the interacted features are separated and restored to be in the original resolution size, and finally the two interacted and enhanced features are added, wherein the specific description is as follows:
Figure BDA0003820853400000041
the Dl (l =3,4) represents the fusion feature of the multi-modal of the two high layers, the Merge (·) operation represents the cascade of the block level, the MHSA (·) operation represents the multi-head self-attention layer, the Split (·) operation represents the splitting of the block sequence into two from the middle, and the Sum (·) operation represents the element-by-element addition of the Split-two block sequence;
s3.2: performing global maximum pooling, convolution and activation operation on the full-focus features of the lower two layers to obtain spatial weight, multiplying the enhanced focus stack features and the spatial weight obtained through the full-focus features element by element, and finally performing residual connection, wherein the specific description is as follows:
Figure BDA0003820853400000042
the Dl (l =1,2) represents the fusion feature of the lower two-layered multi-modality, the P (·) operation represents the global maximum pooling of channel dimensions, conv (·) represents the convolutional layer, σ (·) represents the activation function, and "+" refers to the element-by-element addition operation.
Further, in step S4, the fused features of the multiple modalities are decoded to generate a saliency map, which specifically operates as follows:
s4.1: the multi-modal fusion features are up-sampled and added layer by layer, and the method is specifically described as follows:
D=Up 2 (Up 2 (Up 2 (D 4 )+D 3 )+D 2 )+D 1
the D represents the final feature, up, after fusing the fused features of the four-layer multi-mode 2 (. Cndot.) represents two-fold upsampling;
s4.2: and (3) performing convolution, activation and upsampling on the final characteristic D with an output channel of 1 to restore to the original input image size, specifically describing as follows:
S=Sig(Up 4 (Conv(D)))
said S represents the saliency map, up 4 (. Cndot.) represents a quadruple upsampling and Sig (. Cndot.) represents a Sigmoid activation function.
Further, in step S5, using a saliency map true value to supervise the saliency map, and forming a light field image saliency target detection model through training of a training set; the training set used 1000 pictures on the DUTLF-FS dataset and 100 pictures on the HFUT-Lytro dataset, and the supervision used cross-entropy loss.
Further, in step S6, any one light field image is detected by using the light field image salient object detection model, and through steps S1 to S4, a salient image is output as a detection result; the test set employed the DUTLF-FS dataset and the other pictures and LFSD datasets on the HFUT-Lytro dataset except the training set.
The method is compared with two RGB, two RGB-D and eight light field image significant target detection methods PoolNet [1], PGNet [2], BBSNet [3], swinNet [4], moLF [5], DLFS [6], LFNet [7], ERNet [8], SA-Net [9], DLGLRG [10], PANET [11], MEANet [12], and the result is shown in Table 1.
TABLE 1 results of the experiment
Figure BDA0003820853400000051
[1]Liu,J.-J.;Hou,Q.;Cheng,M.-M.;Feng,J.;and Jiang,J.2019.A Simple Pooling-Based Design for Real-Time Salient Object Detection.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,3917–3926.
[2]Xie,C.;Xia,C.;Ma,M.;Zhao,Z.;Chen,X.;and Li,J.2022.Pyramid Grafting Network for One-Stage High Resolution Saliency Detection.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,11717–11726.
[3]Fan,D.-P.;Zhai,Y.;Borji,A.;Yang,J.;and Shao,L.2020.BBS-Net:RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network.In European conference on computer vision,275–292.Springer.
[4]Liu,Z.;Tan,Y.;He,Q.;and Xiao,Y.2022.SwinNet:Swin Transformer Drives Edge-Aware RGB-Dand RGB-T Salient Object Detection.IEEE Transactions on Circuits and Systems for Video Technology,32(7):4486–4497.
[5]Zhang,M.;Li,J.;Wei,J.;Piao,Y.;and Lu,H.2019.Memory-Oriented Decoder for Light Field Salient Object Detection.Advances in Neural Information Processing Systems,32:1–11.
[6]Piao,Y.;Rong,Z.;Zhang,M.;Li,X.;and Lu,H.2019.Deep Light-Field-Driven Saliency Detection From a Single View.In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence,904–911.
[7]Zhang,M.;Ji,W.;Piao,Y.;Li,J.;Zhang,Y.;Xu,S.;and Lu,H.2020b.LFNet:Light Field Fusion Network for Salient Object Detection.IEEE Transactions on Image Processing,29:6276–6287.
[8]Piao,Y.;Rong,Z.;Zhang,M.;and Lu,H.2020.Exploit and Replace:An Asymmetrical Two-Stream Architecture for Versatile Light Field Saliency Detection.In Proceedings of the AAAI Conference on Artificial Intelligence,11865–11873.
[9]Zhang,Y.;Chen,G.;Chen,Q.;Sun,Y.;Xia,Y.;Deforges,O.;Hamidouche,W.;and Zhang,L.2021c.Learning Synergistic Attention for Light Field Salient Object Detection.In Procedings of British Machine Vision Conference,1–14.
[10]Liu,N.;Zhao,W.;Zhang,D.;Han,J.;and Shao,L.2021a.Light Field Saliency Detection with Dual Local Graph Learning and Reciprocative Guidance.In Proceedings of the IEEE/CVF International Conference on Computer Vision,4712–4721.
[11]Piao,Y.;Jiang,Y.;Zhang,M.;Wang,J.;and Lu,H.2021.PANet:Patch-Aware Network for Light Field Salient Object Detection.IEEE Transactions on Cybernetics,1–13.
[12]Jiang,Y.;Zhang,W.;Fu,K.;and Zhao,Q.2022.Meanet:Multi-Modal Edge-Aware Network for Light Field Salient Object Detection.Neurocomputing,491:78–90.
As shown in Table 1, the method of the embodiment of the invention obtains the optimal result on the evaluation indexes of S-measure, adaptive F-measure, adaptive E-measure and MAE.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (1)

1. The invention discloses a light field image salient object detection method based on a learnable weight descriptor, which comprises the following steps:
s1, respectively extracting a full focusing feature and a focus stack feature from a light field image;
s2, weighting the focus stack characteristics to generate enhanced focus stack characteristics; the weight is realized by a Transformer decoder through a weight descriptor which can be learned;
s3, hierarchically interacting the full focusing features and the enhanced focus stack features to generate multi-modal fusion features;
s4, decoding the multi-modal fusion features to generate a saliency map;
s5, monitoring the saliency map by using a saliency map truth value, and forming a light field image saliency target detection model through training of a training set;
and S6, detecting any one light field image by using the light field image salient object detection model, and outputting a salient image as a detection result through the steps S1-S4.
CN202211047306.3A 2022-08-29 2022-08-29 Light field image salient object detection method based on learnable weight descriptor Pending CN115546512A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211047306.3A CN115546512A (en) 2022-08-29 2022-08-29 Light field image salient object detection method based on learnable weight descriptor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211047306.3A CN115546512A (en) 2022-08-29 2022-08-29 Light field image salient object detection method based on learnable weight descriptor

Publications (1)

Publication Number Publication Date
CN115546512A true CN115546512A (en) 2022-12-30

Family

ID=84725259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211047306.3A Pending CN115546512A (en) 2022-08-29 2022-08-29 Light field image salient object detection method based on learnable weight descriptor

Country Status (1)

Country Link
CN (1) CN115546512A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117253054A (en) * 2023-11-20 2023-12-19 浙江优众新材料科技有限公司 Light field significance detection method and related equipment thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117253054A (en) * 2023-11-20 2023-12-19 浙江优众新材料科技有限公司 Light field significance detection method and related equipment thereof
CN117253054B (en) * 2023-11-20 2024-02-06 浙江优众新材料科技有限公司 Light field significance detection method and related equipment thereof

Similar Documents

Publication Publication Date Title
Ghosh et al. Stacked spatio-temporal graph convolutional networks for action segmentation
CN111582316B (en) RGB-D significance target detection method
WO2021018163A1 (en) Neural network search method and apparatus
CN113076957A (en) RGB-D image saliency target detection method based on cross-modal feature fusion
CN110543890A (en) Deep neural network image matching method based on characteristic pyramid
CN109766918B (en) Salient object detection method based on multilevel context information fusion
CN116758130A (en) Monocular depth prediction method based on multipath feature extraction and multi-scale feature fusion
CN116580192A (en) RGB-D semantic segmentation method and system based on self-adaptive context awareness network
Wang et al. TF-SOD: a novel transformer framework for salient object detection
CN115546512A (en) Light field image salient object detection method based on learnable weight descriptor
CN116229222A (en) Light field saliency target detection method and device based on implicit graph learning
CN116485860A (en) Monocular depth prediction algorithm based on multi-scale progressive interaction and aggregation cross attention features
Zhang et al. Spatial-information guided adaptive context-aware network for efficient RGB-D semantic segmentation
CN116884074A (en) Lightweight face recognition method based on mixed attention mechanism
CN117078539A (en) CNN-transducer-based local global interactive image restoration method
CN113298154B (en) RGB-D image salient object detection method
CN112927250B (en) Edge detection system and method based on multi-granularity attention hierarchical network
CN114821438A (en) Video human behavior identification method and system based on multipath excitation
CN114613011A (en) Human body 3D (three-dimensional) bone behavior identification method based on graph attention convolutional neural network
CN111047571B (en) Image salient target detection method with self-adaptive selection training process
CN114639166A (en) Examination room abnormal behavior recognition method based on motion recognition
CN110765864A (en) Image pedestrian re-identification system and method based on resolution irrelevant features
Wang et al. Gait Recognition based on lightweight CNNs
Mao et al. Traffic Scene Object Detection Algorithm Based on Improved SSD
Wanjun et al. Yoga action recognition based on STF-ResNet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination