CN115546512A - Light field image salient object detection method based on learnable weight descriptor - Google Patents
Light field image salient object detection method based on learnable weight descriptor Download PDFInfo
- Publication number
- CN115546512A CN115546512A CN202211047306.3A CN202211047306A CN115546512A CN 115546512 A CN115546512 A CN 115546512A CN 202211047306 A CN202211047306 A CN 202211047306A CN 115546512 A CN115546512 A CN 115546512A
- Authority
- CN
- China
- Prior art keywords
- light field
- field image
- features
- focus
- stack
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 27
- 230000004927 fusion Effects 0.000 claims abstract description 18
- 238000012549 training Methods 0.000 claims abstract description 12
- 238000012544 monitoring process Methods 0.000 claims abstract description 4
- 238000000034 method Methods 0.000 abstract description 11
- 230000003993 interaction Effects 0.000 abstract description 2
- 230000004913 activation Effects 0.000 description 4
- 230000006835 compression Effects 0.000 description 4
- 238000007906 compression Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 238000011176 pooling Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000010339 dilation Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/42—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method for detecting a light field image salient target based on a learnable weight descriptor, which comprises the following steps: s1, respectively extracting a full-focus feature and a focus stack feature from a light field image; s2, weighting the focus stack characteristics to generate enhanced focus stack characteristics; the weight is realized by a Transformer decoder through a weight descriptor which can be learned; s3, hierarchically interacting the full focus features and the enhanced focus stack features to generate multi-modal fusion features; s4, decoding the multi-modal fusion features to generate a saliency map; s5, monitoring the saliency map by using a saliency map truth value, and forming a light field image saliency target detection model through training of a training set; and S6, detecting any one light field image by using the light field image salient object detection model, and outputting a salient image as a detection result through the steps S1-S4. According to the method, inter-focal-slice weighting and spatial and channel weighting are carried out on the characteristics of the focal stack through the learnable weight descriptors so as to obtain more effective information, and through layered multi-modal fusion, sufficient interaction of information among multiple modes is promoted, effective fusion is realized, and detection precision is improved.
Description
Technical Field
The invention relates to the field of computer vision, in particular to a method for detecting a significant target of a light field image based on a learnable weight descriptor.
Background
The lightfield image is composed of a fully focused image and a focal stack containing a series of images focused at different depths. The full focus image and the focal stack correspond to different modalities of the same scene, with the full focus modality emphasizing appearance and overall information and the focal stack modality emphasizing geometry and region information. The method has the advantages that firstly, a series of images focused at different depths in a focal stack are required to be effectively fused to supplement respective information, and secondly, two modal information of a full-focus image and the focal stack are required to be effectively fused to reduce the difference between the two modal information, maximize the common point of the two modal information, and provide better characteristics for the decoding stage of the significant target.
Disclosure of Invention
The invention aims to solve the technical problem of providing a light field image salient object detection method based on a learnable weight descriptor, wherein the learnable weight descriptor is used for weighting the characteristics of a focus stack among focal slices and on a space and a channel so as to obtain more effective information; through layered multi-modal fusion, the difference between a full focus mode and a focus stack mode is reduced, and the detection performance is improved.
The technical scheme adopted by the invention is as follows:
a method for detecting salient objects in light field images based on learnable weight descriptors, the method comprising the steps of:
s1, respectively extracting a full focusing feature and a focus stack feature from a light field image;
s2, weighting the focus stack characteristics to generate enhanced focus stack characteristics; the weight is realized by a Transformer decoder through a weight descriptor which can be learned;
s3, hierarchically interacting the full focusing features and the enhanced focus stack features to generate multi-modal fusion features;
s4, decoding the multi-modal fusion features to generate a saliency map;
s5, monitoring the saliency map by using a saliency map truth value, and forming a light field image saliency target detection model through training of a training set;
and S6, detecting any one light field image by using the light field image salient object detection model, and outputting a salient image as a detection result through the steps S1-S4.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a light field image salient object detection method based on a learnable weight descriptor, which is characterized in that inter-focal-slice weighting and space and channel weighting are carried out on the characteristics of a focal stack through the learnable weight descriptor so as to obtain more effective information, and through layered multi-modal fusion, sufficient interaction of information among multiple modes is promoted, effective fusion is realized, and the detection precision is improved.
Drawings
FIG. 1 is a flow chart of a method for detecting a salient object in a light field image based on a learnable weight descriptor according to the present invention;
the present invention will be further described with reference to the following detailed description and accompanying drawings, but the embodiments of the invention are not limited thereto.
Detailed Description
The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.
The embodiment of the invention provides a method for detecting a significant target of a light field image based on a learnable weight descriptor, which comprises the following steps:
s1, respectively extracting a full-focus feature and a focus stack feature from a light field image;
s2, weighting the focal stack characteristics to generate enhanced focal stack characteristics; the weight is realized by a Transformer decoder through a weight descriptor which can be learned;
s3, hierarchically interacting the full focusing features and the enhanced focus stack features to generate multi-modal fusion features;
s4, decoding the multi-modal fusion features to generate a saliency map;
s5, monitoring the saliency map by using a saliency map truth value, and forming a light field image saliency target detection model through training of a training set;
and S6, detecting any one light field image by using the light field image salient object detection model, and outputting a salient image as a detection result through the steps S1-S4.
Further, in the step S1, the method for extracting the all-focus feature and the focus stack feature is to extract the all-focus feature by using two Pyramid Vision Transformer neural network models pre-trained on ImageNet respectivelyAnd focal Stack featuresWhere the subscript k represents the number of focal slices in the focal stack, from 1 to 12,l represents the number of layers, corresponding to the number of layers of the Pyramid Vision transducer, and l is a natural number from 1 to 4.
Further, in the step S2, weighting the focal stack features to generate enhanced focal stack features; the weight is realized by a Transformer decoder through a weight descriptor which can be learned; the specific operation is as follows:
s2.1: the full focus feature and the focus stack feature obtained in step S1 have different resolutions and channel numbers, and in order to reduce the amount of computation, the dilated convolution is used to increase the receptive field and compress the channels into 32, which is specifically described as:
the above-mentionedRepresents the characteristics of the kth focal plate of the l layer after the expansion convolution operation and the channel compression, l is from 1 to 4,k from 0 to 12, and represents that RFB (-) operates on the full focus characteristic and the focus stack characteristic; RFB (. Cndot.) operation refers to the module of increasing the Receptive field composed of dilated convolutions of different dilation rates proposed in the paper "Receptive field block net for acurate and fast object detection";
s2.2: defining a weight descriptor Q, and learning the weights of different focal slices, regions and channels by using Q; taking the characteristics of a 4-layer focus stack after channel compression as K and V, and sending the K and V and the designed Q into a transform decoder together, wherein the characteristics are specifically described as follows:
the above-mentionedRepresents the learned weight descriptor, Q p A query point representing a query Q is shown,is a multi-layer focus stack characteristic after channel compression, MSDeformAttn (·) operation refers to the paper "Deformable DETR: a multi-scale Deformable transform decoder as set forth in Deformable transforms for End-to-End Object Detection;
s2.3: the multi-layer focus stack characteristics after the channel compression and the learned weight descriptors are describedElement-by-element multiplication is performed to weight the focal plate and the regions and channels respectively, thereby forming an enhanced multi-layer focal stack feature, which is described in detail as:
the above-mentionedRepresenting a 4-layer enhanced focal Stack feature, the Reshape (·) operation represents the weight descriptorRecovering the feature from the sequence whenUp-sampling operation Up (-) pair is used when the resolution of the multi-layer focus stack feature is not consistentImplementing an increase resolution, "×" refers to element-by-element multiplication operation;
s2.4: cascading the 12 focal stack features of each layer to form an enhanced focal stack feature, which is specifically described as:
said H l Represents a 4-layer enhanced focus stack feature, the Concat (·) operation represents a cascading operation;
s2.5: in order to make the model learn the correct weight descriptor, the enhanced focus stack features of each layer are supervised by a saliency map value GT, which is specifically described as:
the Pred (-) operation represents the prediction header that produces the saliency map, loss (-) refers to the paper F 3 Net: pixel position perception loss function as proposed in Fusion, feed and Focus for sale Object Detection.
Further, in step S3, the full focus feature and the enhanced focus stack feature are hierarchically interacted to generate a multi-modal fusion feature, which specifically operates as follows:
s3.1: flattening operation is carried out on the full focusing features and the enhanced focus stack features of two high layers, then block-level cascade operation is carried out, a cascaded sequence is sent into a multi-head self-attention layer, the full focusing features and the enhanced focus stack features are fully interacted, the interacted features are separated and restored to be in the original resolution size, and finally the two interacted and enhanced features are added, wherein the specific description is as follows:
the Dl (l =3,4) represents the fusion feature of the multi-modal of the two high layers, the Merge (·) operation represents the cascade of the block level, the MHSA (·) operation represents the multi-head self-attention layer, the Split (·) operation represents the splitting of the block sequence into two from the middle, and the Sum (·) operation represents the element-by-element addition of the Split-two block sequence;
s3.2: performing global maximum pooling, convolution and activation operation on the full-focus features of the lower two layers to obtain spatial weight, multiplying the enhanced focus stack features and the spatial weight obtained through the full-focus features element by element, and finally performing residual connection, wherein the specific description is as follows:
the Dl (l =1,2) represents the fusion feature of the lower two-layered multi-modality, the P (·) operation represents the global maximum pooling of channel dimensions, conv (·) represents the convolutional layer, σ (·) represents the activation function, and "+" refers to the element-by-element addition operation.
Further, in step S4, the fused features of the multiple modalities are decoded to generate a saliency map, which specifically operates as follows:
s4.1: the multi-modal fusion features are up-sampled and added layer by layer, and the method is specifically described as follows:
D=Up 2 (Up 2 (Up 2 (D 4 )+D 3 )+D 2 )+D 1
the D represents the final feature, up, after fusing the fused features of the four-layer multi-mode 2 (. Cndot.) represents two-fold upsampling;
s4.2: and (3) performing convolution, activation and upsampling on the final characteristic D with an output channel of 1 to restore to the original input image size, specifically describing as follows:
S=Sig(Up 4 (Conv(D)))
said S represents the saliency map, up 4 (. Cndot.) represents a quadruple upsampling and Sig (. Cndot.) represents a Sigmoid activation function.
Further, in step S5, using a saliency map true value to supervise the saliency map, and forming a light field image saliency target detection model through training of a training set; the training set used 1000 pictures on the DUTLF-FS dataset and 100 pictures on the HFUT-Lytro dataset, and the supervision used cross-entropy loss.
Further, in step S6, any one light field image is detected by using the light field image salient object detection model, and through steps S1 to S4, a salient image is output as a detection result; the test set employed the DUTLF-FS dataset and the other pictures and LFSD datasets on the HFUT-Lytro dataset except the training set.
The method is compared with two RGB, two RGB-D and eight light field image significant target detection methods PoolNet [1], PGNet [2], BBSNet [3], swinNet [4], moLF [5], DLFS [6], LFNet [7], ERNet [8], SA-Net [9], DLGLRG [10], PANET [11], MEANet [12], and the result is shown in Table 1.
TABLE 1 results of the experiment
[1]Liu,J.-J.;Hou,Q.;Cheng,M.-M.;Feng,J.;and Jiang,J.2019.A Simple Pooling-Based Design for Real-Time Salient Object Detection.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,3917–3926.
[2]Xie,C.;Xia,C.;Ma,M.;Zhao,Z.;Chen,X.;and Li,J.2022.Pyramid Grafting Network for One-Stage High Resolution Saliency Detection.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,11717–11726.
[3]Fan,D.-P.;Zhai,Y.;Borji,A.;Yang,J.;and Shao,L.2020.BBS-Net:RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network.In European conference on computer vision,275–292.Springer.
[4]Liu,Z.;Tan,Y.;He,Q.;and Xiao,Y.2022.SwinNet:Swin Transformer Drives Edge-Aware RGB-Dand RGB-T Salient Object Detection.IEEE Transactions on Circuits and Systems for Video Technology,32(7):4486–4497.
[5]Zhang,M.;Li,J.;Wei,J.;Piao,Y.;and Lu,H.2019.Memory-Oriented Decoder for Light Field Salient Object Detection.Advances in Neural Information Processing Systems,32:1–11.
[6]Piao,Y.;Rong,Z.;Zhang,M.;Li,X.;and Lu,H.2019.Deep Light-Field-Driven Saliency Detection From a Single View.In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence,904–911.
[7]Zhang,M.;Ji,W.;Piao,Y.;Li,J.;Zhang,Y.;Xu,S.;and Lu,H.2020b.LFNet:Light Field Fusion Network for Salient Object Detection.IEEE Transactions on Image Processing,29:6276–6287.
[8]Piao,Y.;Rong,Z.;Zhang,M.;and Lu,H.2020.Exploit and Replace:An Asymmetrical Two-Stream Architecture for Versatile Light Field Saliency Detection.In Proceedings of the AAAI Conference on Artificial Intelligence,11865–11873.
[9]Zhang,Y.;Chen,G.;Chen,Q.;Sun,Y.;Xia,Y.;Deforges,O.;Hamidouche,W.;and Zhang,L.2021c.Learning Synergistic Attention for Light Field Salient Object Detection.In Procedings of British Machine Vision Conference,1–14.
[10]Liu,N.;Zhao,W.;Zhang,D.;Han,J.;and Shao,L.2021a.Light Field Saliency Detection with Dual Local Graph Learning and Reciprocative Guidance.In Proceedings of the IEEE/CVF International Conference on Computer Vision,4712–4721.
[11]Piao,Y.;Jiang,Y.;Zhang,M.;Wang,J.;and Lu,H.2021.PANet:Patch-Aware Network for Light Field Salient Object Detection.IEEE Transactions on Cybernetics,1–13.
[12]Jiang,Y.;Zhang,W.;Fu,K.;and Zhao,Q.2022.Meanet:Multi-Modal Edge-Aware Network for Light Field Salient Object Detection.Neurocomputing,491:78–90.
As shown in Table 1, the method of the embodiment of the invention obtains the optimal result on the evaluation indexes of S-measure, adaptive F-measure, adaptive E-measure and MAE.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (1)
1. The invention discloses a light field image salient object detection method based on a learnable weight descriptor, which comprises the following steps:
s1, respectively extracting a full focusing feature and a focus stack feature from a light field image;
s2, weighting the focus stack characteristics to generate enhanced focus stack characteristics; the weight is realized by a Transformer decoder through a weight descriptor which can be learned;
s3, hierarchically interacting the full focusing features and the enhanced focus stack features to generate multi-modal fusion features;
s4, decoding the multi-modal fusion features to generate a saliency map;
s5, monitoring the saliency map by using a saliency map truth value, and forming a light field image saliency target detection model through training of a training set;
and S6, detecting any one light field image by using the light field image salient object detection model, and outputting a salient image as a detection result through the steps S1-S4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211047306.3A CN115546512A (en) | 2022-08-29 | 2022-08-29 | Light field image salient object detection method based on learnable weight descriptor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211047306.3A CN115546512A (en) | 2022-08-29 | 2022-08-29 | Light field image salient object detection method based on learnable weight descriptor |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115546512A true CN115546512A (en) | 2022-12-30 |
Family
ID=84725259
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211047306.3A Pending CN115546512A (en) | 2022-08-29 | 2022-08-29 | Light field image salient object detection method based on learnable weight descriptor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115546512A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117253054A (en) * | 2023-11-20 | 2023-12-19 | 浙江优众新材料科技有限公司 | Light field significance detection method and related equipment thereof |
-
2022
- 2022-08-29 CN CN202211047306.3A patent/CN115546512A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117253054A (en) * | 2023-11-20 | 2023-12-19 | 浙江优众新材料科技有限公司 | Light field significance detection method and related equipment thereof |
CN117253054B (en) * | 2023-11-20 | 2024-02-06 | 浙江优众新材料科技有限公司 | Light field significance detection method and related equipment thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ghosh et al. | Stacked spatio-temporal graph convolutional networks for action segmentation | |
CN111582316B (en) | RGB-D significance target detection method | |
WO2021018163A1 (en) | Neural network search method and apparatus | |
CN113076957A (en) | RGB-D image saliency target detection method based on cross-modal feature fusion | |
CN110543890A (en) | Deep neural network image matching method based on characteristic pyramid | |
CN109766918B (en) | Salient object detection method based on multilevel context information fusion | |
CN116758130A (en) | Monocular depth prediction method based on multipath feature extraction and multi-scale feature fusion | |
CN116580192A (en) | RGB-D semantic segmentation method and system based on self-adaptive context awareness network | |
Wang et al. | TF-SOD: a novel transformer framework for salient object detection | |
CN115546512A (en) | Light field image salient object detection method based on learnable weight descriptor | |
CN116229222A (en) | Light field saliency target detection method and device based on implicit graph learning | |
CN116485860A (en) | Monocular depth prediction algorithm based on multi-scale progressive interaction and aggregation cross attention features | |
Zhang et al. | Spatial-information guided adaptive context-aware network for efficient RGB-D semantic segmentation | |
CN116884074A (en) | Lightweight face recognition method based on mixed attention mechanism | |
CN117078539A (en) | CNN-transducer-based local global interactive image restoration method | |
CN113298154B (en) | RGB-D image salient object detection method | |
CN112927250B (en) | Edge detection system and method based on multi-granularity attention hierarchical network | |
CN114821438A (en) | Video human behavior identification method and system based on multipath excitation | |
CN114613011A (en) | Human body 3D (three-dimensional) bone behavior identification method based on graph attention convolutional neural network | |
CN111047571B (en) | Image salient target detection method with self-adaptive selection training process | |
CN114639166A (en) | Examination room abnormal behavior recognition method based on motion recognition | |
CN110765864A (en) | Image pedestrian re-identification system and method based on resolution irrelevant features | |
Wang et al. | Gait Recognition based on lightweight CNNs | |
Mao et al. | Traffic Scene Object Detection Algorithm Based on Improved SSD | |
Wanjun et al. | Yoga action recognition based on STF-ResNet |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |