CN114898120A - 360-degree image salient target detection method based on convolutional neural network - Google Patents

360-degree image salient target detection method based on convolutional neural network Download PDF

Info

Publication number
CN114898120A
CN114898120A CN202210586991.0A CN202210586991A CN114898120A CN 114898120 A CN114898120 A CN 114898120A CN 202210586991 A CN202210586991 A CN 202210586991A CN 114898120 A CN114898120 A CN 114898120A
Authority
CN
China
Prior art keywords
image
features
projection
feature
equidistant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210586991.0A
Other languages
Chinese (zh)
Other versions
CN114898120B (en
Inventor
周晓飞
罗晨浩
张继勇
李世锋
周振
何帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202210586991.0A priority Critical patent/CN114898120B/en
Publication of CN114898120A publication Critical patent/CN114898120A/en
Application granted granted Critical
Publication of CN114898120B publication Critical patent/CN114898120B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a 360-degree image salient target detection method based on a convolutional neural network, which comprises the following steps of: s1, image conversion; s2, building a characteristic pyramid network; s3, four feature aggregation modules are adopted, each module is used for completing conversion from cube projection features to equidistant features by a feature conversion submodule and combining the characteristics with original equidistant image features, and a cavity convolution pooling pyramid submodule is used for realizing feature optimization, so that multi-level aggregation features are obtained; and S4, connecting and feeding multi-level aggregation features to the attention integration module, adaptively selecting reliable space and channel information through an inference space and channel attention mechanism, fusing the reliable space and channel information with the original features to obtain final features, and completing the detection of the salient target. The method uses an image mapping relation to construct a corresponding cubic projection image based on an equidistant 360-degree image, and uses double-type images as input to solve the problem of poor distortion of spherical surface-to-plane projection caused by single equal-rectangular image input.

Description

360-degree image salient target detection method based on convolutional neural network
Technical Field
The invention relates to the technical field of computer vision, in particular to a 360-degree image salient target detection method based on a convolutional neural network.
Background
The 360-degree image, namely the 360-degree panoramic image, is an image obtained by performing multi-angle all-around shooting on an existing scene by utilizing shooting equipment and performing post-processing by using a computer, and is a three-dimensional virtual scene display technology. As a brand-new display form, the method has wide application scenes, such as all-around display of tourist attractions, hotels and guest houses, all-around analysis of road condition environment by automatic driving, development of VR film and television entertainment and the like, and the development of 360-degree image technology cannot be separated. The detection of the remarkable target in the 360-degree image is beneficial to quickly locking pedestrians and target buildings in a scene, and has higher research significance in different fields.
The detection and segmentation of salient objects in natural scenes, commonly referred to as salient object detection, aims to capture the most visually attractive object in an image, and can be applied to a wide range of visual fields such as image video segmentation, image understanding, semantic segmentation, image object emphasis and the like. In recent years, with the continuous development of convolutional neural networks, a conventional image salient object detection model has achieved high performance in a limited visual field scene. However, a 360-degree panoramic image is a novel image representation. At present, two common ways are to display the global object information as a two-dimensional image in the form of an isometric projection or a cube projection.
Among them, the isometric projection is one of the most common methods for storing a 360-degree panoramic image as a standard 2D image, and displays the full-range image information of a real 3D world with a single two-dimensional plane, but forges real semantic information due to the projection distortion from a spherical surface to a plane. Currently, although many scholars have proposed various non-convolutional network algorithms to process these false information, most of the existing convolutional neural network-based salient object detection models still cannot accurately highlight salient objects in images from distorted semantic information due to the characteristic that convolutional neural networks are sensitive to regular grid data and insensitive to distorted data.
Compared with the isometric projection, the cube projection is to cut a 360-degree panoramic image into six faces of a cube, and present global information in images with 6 orientations (up, down, left, right, front and back), and although the salient object detection method using the data only introduces less geometric distortion, the edge details are often not well displayed due to the discontinuity of each face junction of the cube image.
It can be seen that, although the two forms of the isometric projection and the cube projection can show the global object information as a two-dimensional image, the projection distortion of the spherical surface to the plane is inevitably introduced. Resulting in that directly employing conventional object detection models will likely not accurately highlight salient objects in these images.
Disclosure of Invention
The invention provides a 360-degree image salient target detection method based on a convolutional neural network according to the defects of the prior art, a corresponding cubic projection image is constructed based on an equidistant 360-degree image by using an image mapping relation, and two kinds of images are used as input, so that the problem of poor distortion of spherical surface-to-plane projection caused by single input of the equidistant 360-degree image is solved.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a360-degree image salient object detection method based on a convolutional neural network comprises the following steps:
s1, image conversion
S1-1, creating a data set of equidistant 360-degree images;
s1-2, establishing an image conversion module;
s1-3, after reading the equidistant 360-degree images in the data set, generating corresponding cubic projection images by using an image conversion module;
s2, constructing a characteristic pyramid network, and performing characteristic extraction on the equidistant 360-degree image and the converted cube projection image to obtain equidistant 360-degree image characteristics and cube projection characteristics;
s3, four identical feature aggregation modules are adopted, each module is subjected to conversion from cube projection features to equidistant features by a feature conversion submodule, and is combined with the features of an equidistant 360-degree image, and then a cavity convolution pooling pyramid submodule is used for realizing feature optimization, so that multi-level aggregation features are obtained;
and S4, connecting and feeding multi-level aggregation features to the attention integration module, adaptively selecting reliable space and channel information through an inference space and channel attention mechanism, fusing the reliable space and channel information with the original features to obtain final features, and completing the detection of the salient target.
Preferably, in step S1-2, the isometric 360-degree image is generated into a corresponding cube projection image by using the mapping relationship between the isometric projection and the cube projection.
Preferably, the expression of the mapping relationship between the isomorphic projection and the cubic projection is as follows:
q i =R fi ·p i
Figure BDA0003666373260000031
Figure BDA0003666373260000032
wherein, theta fi 、φ fi Represents the latitude and longitude under the equidistant projection,
Figure BDA0003666373260000033
is the x, y, z component of the q coordinate, R fi Representing a rotation matrix, f i For a given imaging plane, p i For a known imaging plane f i One point of (A), x, y, z represents p i Is determined by the three-dimensional coordinates of (a),
Figure BDA0003666373260000034
preferably, the image data input by the feature pyramid network comprises an equidistant 360-degree image and a cubic projection image, and the equidistant 360-degree image and the cubic projection image corresponding to the equidistant 360-degree image form an image sample.
Preferably, the method for constructing the feature pyramid network comprises the following steps: FPN is adopted as a backbone network, wherein a bottom-up path is built based on Resnet-50.
Preferably, in step S2, the feature extraction method includes:
feature extraction is carried out on seven input images of each image sample, namely, an isometric projection image and six face images of an isometric projection image, a cubic projection image, namely, an upper face image, a lower face image, a left face image, a right face image, a front face image and a rear face image, by adopting a feature pyramid network to obtain isometric image features and cubic projection features,
the upper layer Resnet of each independent FPN feature extraction module in the feature pyramid network is used as a part of a feedforward backsbone, each level of up-sampling is carried out by using step length step 2, output 2-5 levels of features are used for participating in prediction, output layers of conv 2-5 and a last residual block layer are used as features of the FPN, the down-sampling multiples corresponding to input pictures are 4, 8, 16 and 32 respectively, the rightmost small feature graph is enlarged to be as large as the left feature graph in the process of the bottom layer from top to bottom in an up-sampling mode, and finally, the feature results F1-4 of each layer are obtained through layer-by-layer output after being fused with the features of the upper layer.
Preferably, in step S3, a set of four sets of features is output by four identical feature aggregation modules.
Preferably, the conversion method of the feature conversion sub-module is as follows: and converting the 6 cube projection features into isometric projection features by utilizing the mapping relation between the isometric image features and the cube projection features. And combining the feature extracted by using the original isometry shape image to obtain the final mixed feature.
Preferably, the optimization method of the void convolution pooling pyramid sub-module comprises the following steps: the method comprises the following steps of performing convolution parallel sampling on given input holes with different sampling rates, splicing obtained results together, expanding the number of channels, reducing the number of channels to an expected value through convolution with 1 x 1, which is equivalent to capturing the context of an image in a plurality of proportions.
The invention has the following characteristics and beneficial effects:
the image mapping relation is used for constructing a corresponding cubic projection image based on the equidistant 360-degree image, and the problem of poor distortion of spherical surface-to-plane projection caused by single equal-rectangular image input is solved by using the double-type image as input.
And (3) extracting the features of the image of each scale by using a feature pyramid network to generate multi-scale feature representation, and fusing a feature map with strong low-resolution semantic information and a feature map with weak high-resolution semantic information and rich spatial information on the premise of increasing less calculation amount.
And the space and channel attention mechanism is used for adaptively selecting the space and channel information, so that the obtained final characteristics have higher reliability and a more accurate and obvious target image is generated.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a block diagram of an embodiment of the present invention;
fig. 2 is a block diagram of step S2 in this embodiment of the present invention.
Fig. 3 is a block diagram of step S3 in this embodiment of the present invention.
Fig. 4 is a diagram of the ASPP sub-module of step S3 in the embodiment of the present invention.
Fig. 5 is a block diagram of step S4 in the embodiment of the present invention.
FIG. 6 is a diagram of the attention mechanism submodule of step S4 in an embodiment of the present invention.
FIG. 7 is a graph showing the results of the example of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used only for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention. Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art through specific situations.
The invention provides a 360-degree image salient object detection method based on a convolutional neural network, which comprises the following steps of:
s1, image conversion
S1-1, creating a dataset of equidistant 360 degree images.
It should be noted that, in this embodiment, the adopted data set is a public 360-SOD public data set, which contains 500 equidistant 360-degree images with high resolution and their corresponding saliency maps, and the salient objects in the images are mostly people, in this embodiment, 400 of the data sets are adopted as the training data set, and 100 of the data sets are adopted as the test data set to perform the training, testing and evaluation work of the model, and meanwhile, to ensure the consistency of the input data, the input equidistant 360-degree image size is adjusted to 1024 × 512, and the cube projection image size is adjusted to 256 × 256.
S1-2, establishing an image conversion module, and generating a corresponding cube projection image from the equidistant 360-degree image by using the mapping relation between the equidistant projection and the cube projection.
Wherein, the expression of the mapping relation between the isometry projection and the cube projection is as follows:
q i =R fi ·p i
Figure BDA0003666373260000061
Figure BDA0003666373260000062
wherein, theta fi 、φ fi Represents the latitude and longitude under the equidistant projection,
Figure BDA0003666373260000063
is the x, y, z component of the q coordinate.
It will be appreciated that in the projected representation of an equi-spaced 360 degree image, the cube projection is typically represented as 6 faces, each face being a square with a side length w, 6 faces being up, down, front, back, left and right, respectively. Each face can be seen as an image taken independently by a camera with a focal length w/2 (field angle 90 deg.), the projected centers of the 6 cameras coinciding in a point, i.e. the center of the cube. If the world coordinate system origin is set at the cube center, the external parameters of the 6 cameras will be derived only from the rotation matrix R fi Indicating that there is no translation vector. Given an imaging plane f in the camera system i A point p on i And it
Three-dimensional coordinates x, y, z
Figure BDA0003666373260000071
And S1-3, after the equidistant 360-degree images in the data set are read, generating corresponding cubic projection images by using an image conversion module.
S2, building a characteristic pyramid network, and extracting characteristics of the equidistant 360-degree image and the transformed cube projection image to obtain equidistant 360-degree image characteristics and cube projection characteristics.
Specifically, as shown in fig. 2, the method for constructing the feature pyramid network includes: FPN is adopted as a backbone network, wherein a bottom-up path is built based on Resnet-50.
And (3) acquiring the features of the image at different levels by using a Resnet 50-based feature pyramid network and performing weight sharing processing.
The image data input by the feature pyramid network comprises an equidistant 360-degree image and a cubic projection image, and the equidistant 360-degree image and the cubic projection image corresponding to the equidistant 360-degree image form an image sample. Feature extraction is carried out on seven input images of each image sample, namely, an isometric projection image and six face images of an isometric projection image, a cubic projection image, namely, an upper face image, a lower face image, a left face image, a right face image, a front face image and a rear face image, by adopting a feature pyramid network to obtain isometric image features and cubic projection features,
it should be noted that, in the embodiment, since the model training is performed by using the dual-type mixed data, a single sample contains one isometric projection image and six cube projection images, respectively, and the module needs to perform feature extraction on seven images, respectively, so that a set of seven groups of features is finally output.
It should be noted that, the feature pyramid network constructed in this embodiment is used for extracting features, and those skilled in the art can easily obtain the feature pyramid network, specifically as shown in fig. 2, including top-level convolution ResNet50 and 4 convolution layers, the step lengths are 4, 8, 16, and 32, respectively.
Further, the feature extraction method comprises the following steps:
the upper layer Resnet of each independent FPN feature extraction module in the feature pyramid network is used as a part of a feedforward backsbone, each level of up-sampling is carried out by using step length step 2, output 2-5 levels of features are used for participating in prediction, output layers of conv 2-5 and a last residual block layer are used as features of the FPN, the down-sampling multiples corresponding to input pictures are 4, 8, 16 and 32 respectively, the rightmost small feature graph is enlarged to be as large as the left feature graph in the process of the bottom layer from top to bottom in an up-sampling mode, and finally, the feature results F1-4 of each layer are obtained through layer-by-layer output after being fused with the features of the upper layer.
S3, as shown in fig. 3, four identical feature aggregation modules are used to output a set of four groups of features, each feature aggregation module converts the cube projection feature into an equidistant feature by a feature conversion submodule (C2E feature conversion module), combines the feature with the feature of an equidistant 360-degree image, and then uses an empty convolution pooling pyramid submodule (ASPP submodule) to optimize the feature;
the conversion method of the characteristic conversion submodule comprises the following steps: and converting the 6 cube projection features into isometric projection features by utilizing the mapping relation between the isometric image features and the cube projection features.
It should be noted that: the expression of the mapping relation between the cubic projection feature and the isomorphic projection feature is as follows:
R fi ·p i =q i
Figure BDA0003666373260000081
Figure BDA0003666373260000082
wherein, theta fi 、φ fi Represents the latitude and longitude under the equidistant projection,
Figure BDA0003666373260000083
is the x, y, z component of the q coordinate.
It should be noted that: in the present embodiment, the feature conversion is performed by the C2E feature conversion module, which is a conventional technical means, and therefore, the C2E feature conversion module is not specifically described and illustrated, and specifically refer to fig. 3.
Further, as shown in fig. 4, the optimization method of the void convolution pooling pyramid sub-module includes: the method comprises the following steps of performing convolution parallel sampling on given input holes with different sampling rates, splicing obtained results together, expanding the number of channels, reducing the number of channels to an expected value through convolution with 1 x 1, which is equivalent to capturing the context of an image in a plurality of proportions.
It should be noted that: in this embodiment, feature optimization is performed by using a cavity convolution pooling pyramid sub-module (APSP sub-module), which is a conventional technical means, and specifically, referring to fig. 4, the feature optimization includes 3 1 × 1 convolution layers, 3 × 3 convolution layers, 1 × 1 pooling layers, and an upsampling layer, where sampling rates of the 3 × 3 convolution layers are 6, 12, and 18, respectively.
S4, as shown in FIG. 5, connecting and feeding multi-level aggregation features to the attention integration module, adaptively selecting reliable space and channel information by inferring a space and channel attention mechanism to fuse with the original features to obtain final features and complete the detection of the salient object.
It should be noted that: in this embodiment, feature fusion is performed through the attention integration module, which is a conventional technical means, and specifically refer to fig. 5, including 3 1 × 1 convolutional layers, 1 3 × 3 convolutional layer, a space attention module, and a channel attention module. The space attention module and the channel attention module are conventional in the art, and therefore, they are not specifically described and illustrated in this embodiment.
As shown in fig. 6, the spatial attention mechanism in the present network first performs dimension reduction on the channels themselves, splices them into a one-dimensional feature map, and then uses a convolutional layer to learn the overall spatial attention and feeds it to the four channels for integration. The channel attention mechanism uses the maximum pooling algorithm and the mean pooling algorithm for the four-channel overall feature map at the same time, then obtains a conversion result through the convolution layer, and finally applies the conversion result to all the channels respectively to obtain the attention value of each channel.
In this embodiment, a network model is constructed using a pytorre framework, the sum of cross entropy loss and mean absolute error loss is used as a loss function, the weight of the feature extraction module is initialized by training a ResNet-50 model in advance on ImageNet, and the weight of the newly added convolutional layer is initialized by using a normal distribution method proposed by the hodcamine. And training the model end to end by using a Stochastic Gradient Descent (SGD) algorithm. Training batch was set to 4, momentum was 0.9, weight decay was 0.0005, initial learning rate was set to 0.002, and final training round was 40 epochs. The model generates a salient object prediction map for a 360 degree image. The prediction map is a grayscale map of pixel values 0 to 1. In the figure, 1 indicates a region where a salient object is located, and 0 indicates a background region.
As can be seen from fig. 7, the present embodiment is improved on the basis of the existing conventional image salient object detection method, so that the present embodiment can be adapted to an equidistant 360-degree image for detection, and a better detection effect is obtained. The network consists of four large modules, including a data processing module (E2C image conversion module) and three network structure modules (feature pyramid network, feature aggregation module, attention mechanism module). The image conversion module completes conversion from an isometric 360-degree image to a cubic projection image, is used for constructing dual-type input data required to be used in a network, and avoids poor distortion of spherical-to-plane projection caused by single isometric image input by taking the dual-type data as input. The FPN feature extraction module extracts multi-level features of various input data and realizes weight sharing, the feature aggregation module integrates and optimizes the multi-level features, and the attention mechanism integration module is used for realizing final reliability weight selection and screening to obtain high-quality significance images. The result is a gray scale image with the pixel value of [0, 1], wherein 1 in the image is represented as the area where the salient object is located in the 360-degree image, and 0 in the image is represented as the background area, and the salient object detection task of the 360-degree image is successfully realized.
The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments, including the components, without departing from the principles and spirit of the invention, and still fall within the scope of the invention.

Claims (9)

1. A360-degree image salient object detection method based on a convolutional neural network is characterized by comprising the following steps:
s1, image conversion
S1-1, creating a data set of equidistant 360-degree images;
s1-2, establishing an image conversion module;
s1-3, after reading the equidistant 360-degree images in the data set, generating corresponding cubic projection images by using an image conversion module;
s2, building a characteristic pyramid network, and extracting characteristics of the equidistant 360-degree image and the converted cube projection image to obtain equidistant 360-degree image characteristics and cube projection characteristics;
s3, four identical feature aggregation modules are adopted, each feature aggregation module is used for completing conversion from cube projection features to equidistant features through a feature conversion submodule, and is combined with the equidistant 360-degree image features, and then a cavity convolution pooling pyramid submodule is used for achieving optimization of combined features, so that multi-level aggregation features are obtained;
and S4, connecting and feeding the multi-level aggregation features to an attention integration module, adaptively selecting reliable space and channel information to be fused with the original features through deducing a space and channel attention mechanism to obtain final features, and completing the detection of the obvious target.
2. The convolutional neural network-based 360-degree image salient object detection method as claimed in claim 1, wherein in the step S1-2, the isometric 360-degree image is generated into a corresponding cube projection image by using the mapping relationship between the isometric projection and the cube projection.
3. The convolutional neural network-based 360-degree image salient object detection method as claimed in claim 2, wherein the expression of the mapping relationship of the isomorphic projection and the cubic projection is as follows:
q i =R fi ·p i
Figure FDA0003666373250000021
Figure FDA0003666373250000022
wherein, theta fi 、φ fi Represents the latitude and longitude under the equidistant projection,
Figure FDA0003666373250000023
is the x, y, z component of the q coordinate, R fi Representing a rotation matrix, f i For a given imaging plane, p i For a known imaging plane f i Point of (1), x, y, z tableShows p i Is determined by the three-dimensional coordinates of (a),
Figure FDA0003666373250000024
4. the convolutional neural network-based 360-degree image salient object detection method as claimed in claim 1, wherein the image data input by the feature pyramid network comprises an equidistant 360-degree image and a cubic projection image, and the equidistant 360-degree image and the cubic projection image corresponding to the equidistant 360-degree image form an image sample.
5. The 360-degree image salient object detection method based on the convolutional neural network as claimed in claim 4, wherein the method for constructing the feature pyramid network is as follows: FPN is adopted as a backbone network, wherein a bottom-up path is built based on Resnet-50.
6. The method for detecting the salient object in the 360-degree image based on the convolutional neural network as claimed in claim 5, wherein in step S2, the feature extraction method is as follows:
feature extraction is carried out on seven input images of each image sample, namely, an isometric projection image and six face images of an isometric projection image, a cubic projection image, namely, an upper face image, a lower face image, a left face image, a right face image, a front face image and a rear face image, by adopting a feature pyramid network to obtain isometric image features and cubic projection features,
the upper layer Resnet of each independent FPN feature extraction module in the feature pyramid network is used as a part of a feedforward backsbone, each level of up-sampling is carried out by using step length step 2, output 2-5 levels of features are used for participating in prediction, output layers of conv 2-5 and a last residual block layer are used as features of the FPN, the down-sampling multiples corresponding to input pictures are 4, 8, 16 and 32 respectively, the rightmost small feature graph is enlarged to be as large as the left feature graph in the process of the bottom layer from top to bottom in an up-sampling mode, and finally, the feature results F1-4 of each layer are obtained through layer-by-layer output after being fused with the features of the upper layer.
7. The convolutional neural network-based 360-degree image salient object detection method of claim 1, wherein in the step S3, a set of four groups of features is output through four identical feature aggregation modules.
8. The convolutional neural network-based 360-degree image salient object detection method as claimed in claim 6, wherein the feature transformation sub-module is used for transforming: and converting the 6 cube projection features into isometric projection features by using the mapping relation between the cube projection features and the isometric image features.
9. The convolutional neural network-based 360-degree image salient object detection method as claimed in claim 8, wherein the optimization method of the hole convolutional pooling pyramid sub-module is as follows: the method comprises the following steps of performing convolution parallel sampling on given input holes with different sampling rates, splicing obtained results together, expanding the number of channels, reducing the number of channels to an expected value through convolution with 1 x 1, which is equivalent to capturing the context of an image in a plurality of proportions.
CN202210586991.0A 2022-05-27 2022-05-27 360-degree image salient object detection method based on convolutional neural network Active CN114898120B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210586991.0A CN114898120B (en) 2022-05-27 2022-05-27 360-degree image salient object detection method based on convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210586991.0A CN114898120B (en) 2022-05-27 2022-05-27 360-degree image salient object detection method based on convolutional neural network

Publications (2)

Publication Number Publication Date
CN114898120A true CN114898120A (en) 2022-08-12
CN114898120B CN114898120B (en) 2023-04-07

Family

ID=82725996

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210586991.0A Active CN114898120B (en) 2022-05-27 2022-05-27 360-degree image salient object detection method based on convolutional neural network

Country Status (1)

Country Link
CN (1) CN114898120B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827193A (en) * 2019-10-21 2020-02-21 国家广播电视总局广播电视规划院 Panoramic video saliency detection method based on multi-channel features
CN111178163A (en) * 2019-12-12 2020-05-19 宁波大学 Cubic projection format-based stereo panoramic image salient region prediction method
AU2020103901A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field
CN112381813A (en) * 2020-11-25 2021-02-19 华南理工大学 Panorama visual saliency detection method based on graph convolution neural network
CN113536977A (en) * 2021-06-28 2021-10-22 杭州电子科技大学 Saliency target detection method facing 360-degree panoramic image
CN114359680A (en) * 2021-12-17 2022-04-15 中国人民解放军海军工程大学 Panoramic vision water surface target detection method based on deep learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827193A (en) * 2019-10-21 2020-02-21 国家广播电视总局广播电视规划院 Panoramic video saliency detection method based on multi-channel features
CN111178163A (en) * 2019-12-12 2020-05-19 宁波大学 Cubic projection format-based stereo panoramic image salient region prediction method
CN112381813A (en) * 2020-11-25 2021-02-19 华南理工大学 Panorama visual saliency detection method based on graph convolution neural network
AU2020103901A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field
CN113536977A (en) * 2021-06-28 2021-10-22 杭州电子科技大学 Saliency target detection method facing 360-degree panoramic image
CN114359680A (en) * 2021-12-17 2022-04-15 中国人民解放军海军工程大学 Panoramic vision water surface target detection method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MENGKE HUANG等: "FANet: Features Adaptation Network for 360◦ Omnidirectional Salient Object Detection", 《IEEE SIGNAL PROCESSING LETTERS》 *
陈斌 等: "基于改进YOLO v3-tiny的全景图像农田障碍物检测", 《农业机械学报》 *

Also Published As

Publication number Publication date
CN114898120B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN112894832B (en) Three-dimensional modeling method, three-dimensional modeling device, electronic equipment and storage medium
CN107945282B (en) Rapid multi-view three-dimensional synthesis and display method and device based on countermeasure network
CN114004941A (en) Indoor scene three-dimensional reconstruction system and method based on nerve radiation field
Zhang et al. A UAV-based panoramic oblique photogrammetry (POP) approach using spherical projection
CN114092780A (en) Three-dimensional target detection method based on point cloud and image data fusion
US20040032407A1 (en) Method and system for simulating stereographic vision
US20220230338A1 (en) Depth image generation method, apparatus, and storage medium and electronic device
TW201618545A (en) Preprocessor for full parallax light field compression
WO2023280038A1 (en) Method for constructing three-dimensional real-scene model, and related apparatus
WO2022151661A1 (en) Three-dimensional reconstruction method and apparatus, device and storage medium
JP2016537901A (en) Light field processing method
US11533431B2 (en) Method and device for generating a panoramic image
CN112330795B (en) Human body three-dimensional reconstruction method and system based on single RGBD image
KR20180053724A (en) Method for encoding bright-field content
CN115035235A (en) Three-dimensional reconstruction method and device
CN116051747A (en) House three-dimensional model reconstruction method, device and medium based on missing point cloud data
CN116778288A (en) Multi-mode fusion target detection system and method
GB2562488A (en) An apparatus, a method and a computer program for video coding and decoding
US10354399B2 (en) Multi-view back-projection to a light-field
CN115527016A (en) Three-dimensional GIS video fusion registration method, system, medium, equipment and terminal
CN117197388A (en) Live-action three-dimensional virtual reality scene construction method and system based on generation of antagonistic neural network and oblique photography
CN114758337A (en) Semantic instance reconstruction method, device, equipment and medium
CN113902802A (en) Visual positioning method and related device, electronic equipment and storage medium
CN114092540A (en) Attention mechanism-based light field depth estimation method and computer readable medium
Neumann et al. Eyes from eyes: analysis of camera design using plenoptic video geometry

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant