CN114898120A

CN114898120A - 360-degree image salient target detection method based on convolutional neural network

Info

Publication number: CN114898120A
Application number: CN202210586991.0A
Authority: CN
Inventors: 周晓飞; 罗晨浩; 张继勇; 李世锋; 周振; 何帆
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2022-08-12
Anticipated expiration: 2042-05-27
Also published as: CN114898120B

Abstract

The invention discloses a 360-degree image salient target detection method based on a convolutional neural network, which comprises the following steps of: s1, image conversion; s2, building a characteristic pyramid network; s3, four feature aggregation modules are adopted, each module is used for completing conversion from cube projection features to equidistant features by a feature conversion submodule and combining the characteristics with original equidistant image features, and a cavity convolution pooling pyramid submodule is used for realizing feature optimization, so that multi-level aggregation features are obtained; and S4, connecting and feeding multi-level aggregation features to the attention integration module, adaptively selecting reliable space and channel information through an inference space and channel attention mechanism, fusing the reliable space and channel information with the original features to obtain final features, and completing the detection of the salient target. The method uses an image mapping relation to construct a corresponding cubic projection image based on an equidistant 360-degree image, and uses double-type images as input to solve the problem of poor distortion of spherical surface-to-plane projection caused by single equal-rectangular image input.

Description

360-degree image salient target detection method based on convolutional neural network

Technical Field

The invention relates to the technical field of computer vision, in particular to a 360-degree image salient target detection method based on a convolutional neural network.

Background

The 360-degree image, namely the 360-degree panoramic image, is an image obtained by performing multi-angle all-around shooting on an existing scene by utilizing shooting equipment and performing post-processing by using a computer, and is a three-dimensional virtual scene display technology. As a brand-new display form, the method has wide application scenes, such as all-around display of tourist attractions, hotels and guest houses, all-around analysis of road condition environment by automatic driving, development of VR film and television entertainment and the like, and the development of 360-degree image technology cannot be separated. The detection of the remarkable target in the 360-degree image is beneficial to quickly locking pedestrians and target buildings in a scene, and has higher research significance in different fields.

The detection and segmentation of salient objects in natural scenes, commonly referred to as salient object detection, aims to capture the most visually attractive object in an image, and can be applied to a wide range of visual fields such as image video segmentation, image understanding, semantic segmentation, image object emphasis and the like. In recent years, with the continuous development of convolutional neural networks, a conventional image salient object detection model has achieved high performance in a limited visual field scene. However, a 360-degree panoramic image is a novel image representation. At present, two common ways are to display the global object information as a two-dimensional image in the form of an isometric projection or a cube projection.

Among them, the isometric projection is one of the most common methods for storing a 360-degree panoramic image as a standard 2D image, and displays the full-range image information of a real 3D world with a single two-dimensional plane, but forges real semantic information due to the projection distortion from a spherical surface to a plane. Currently, although many scholars have proposed various non-convolutional network algorithms to process these false information, most of the existing convolutional neural network-based salient object detection models still cannot accurately highlight salient objects in images from distorted semantic information due to the characteristic that convolutional neural networks are sensitive to regular grid data and insensitive to distorted data.

Compared with the isometric projection, the cube projection is to cut a 360-degree panoramic image into six faces of a cube, and present global information in images with 6 orientations (up, down, left, right, front and back), and although the salient object detection method using the data only introduces less geometric distortion, the edge details are often not well displayed due to the discontinuity of each face junction of the cube image.

It can be seen that, although the two forms of the isometric projection and the cube projection can show the global object information as a two-dimensional image, the projection distortion of the spherical surface to the plane is inevitably introduced. Resulting in that directly employing conventional object detection models will likely not accurately highlight salient objects in these images.

Disclosure of Invention

The invention provides a 360-degree image salient target detection method based on a convolutional neural network according to the defects of the prior art, a corresponding cubic projection image is constructed based on an equidistant 360-degree image by using an image mapping relation, and two kinds of images are used as input, so that the problem of poor distortion of spherical surface-to-plane projection caused by single input of the equidistant 360-degree image is solved.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a360-degree image salient object detection method based on a convolutional neural network comprises the following steps:

s1, image conversion

S1-1, creating a data set of equidistant 360-degree images;

s1-2, establishing an image conversion module;

s1-3, after reading the equidistant 360-degree images in the data set, generating corresponding cubic projection images by using an image conversion module;

s2, constructing a characteristic pyramid network, and performing characteristic extraction on the equidistant 360-degree image and the converted cube projection image to obtain equidistant 360-degree image characteristics and cube projection characteristics;

s3, four identical feature aggregation modules are adopted, each module is subjected to conversion from cube projection features to equidistant features by a feature conversion submodule, and is combined with the features of an equidistant 360-degree image, and then a cavity convolution pooling pyramid submodule is used for realizing feature optimization, so that multi-level aggregation features are obtained;

and S4, connecting and feeding multi-level aggregation features to the attention integration module, adaptively selecting reliable space and channel information through an inference space and channel attention mechanism, fusing the reliable space and channel information with the original features to obtain final features, and completing the detection of the salient target.

Preferably, in step S1-2, the isometric 360-degree image is generated into a corresponding cube projection image by using the mapping relationship between the isometric projection and the cube projection.

Preferably, the expression of the mapping relationship between the isomorphic projection and the cubic projection is as follows:

q _i ＝R _fi ·p _i

wherein, theta _fi 、φ _fi Represents the latitude and longitude under the equidistant projection,

is the x, y, z component of the q coordinate, R _fi Representing a rotation matrix, f _i For a given imaging plane, p _i For a known imaging plane f _i One point of (A), x, y, z represents p _i Is determined by the three-dimensional coordinates of (a),

preferably, the image data input by the feature pyramid network comprises an equidistant 360-degree image and a cubic projection image, and the equidistant 360-degree image and the cubic projection image corresponding to the equidistant 360-degree image form an image sample.

Preferably, the method for constructing the feature pyramid network comprises the following steps: FPN is adopted as a backbone network, wherein a bottom-up path is built based on Resnet-50.

Preferably, in step S2, the feature extraction method includes:

feature extraction is carried out on seven input images of each image sample, namely, an isometric projection image and six face images of an isometric projection image, a cubic projection image, namely, an upper face image, a lower face image, a left face image, a right face image, a front face image and a rear face image, by adopting a feature pyramid network to obtain isometric image features and cubic projection features,

the upper layer Resnet of each independent FPN feature extraction module in the feature pyramid network is used as a part of a feedforward backsbone, each level of up-sampling is carried out by using step length step 2, output 2-5 levels of features are used for participating in prediction, output layers of conv 2-5 and a last residual block layer are used as features of the FPN, the down-sampling multiples corresponding to input pictures are 4, 8, 16 and 32 respectively, the rightmost small feature graph is enlarged to be as large as the left feature graph in the process of the bottom layer from top to bottom in an up-sampling mode, and finally, the feature results F1-4 of each layer are obtained through layer-by-layer output after being fused with the features of the upper layer.

Preferably, in step S3, a set of four sets of features is output by four identical feature aggregation modules.

Preferably, the conversion method of the feature conversion sub-module is as follows: and converting the 6 cube projection features into isometric projection features by utilizing the mapping relation between the isometric image features and the cube projection features. And combining the feature extracted by using the original isometry shape image to obtain the final mixed feature.

Preferably, the optimization method of the void convolution pooling pyramid sub-module comprises the following steps: the method comprises the following steps of performing convolution parallel sampling on given input holes with different sampling rates, splicing obtained results together, expanding the number of channels, reducing the number of channels to an expected value through convolution with 1 x 1, which is equivalent to capturing the context of an image in a plurality of proportions.

The invention has the following characteristics and beneficial effects:

the image mapping relation is used for constructing a corresponding cubic projection image based on the equidistant 360-degree image, and the problem of poor distortion of spherical surface-to-plane projection caused by single equal-rectangular image input is solved by using the double-type image as input.

And (3) extracting the features of the image of each scale by using a feature pyramid network to generate multi-scale feature representation, and fusing a feature map with strong low-resolution semantic information and a feature map with weak high-resolution semantic information and rich spatial information on the premise of increasing less calculation amount.

And the space and channel attention mechanism is used for adaptively selecting the space and channel information, so that the obtained final characteristics have higher reliability and a more accurate and obvious target image is generated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a block diagram of an embodiment of the present invention;

fig. 2 is a block diagram of step S2 in this embodiment of the present invention.

Fig. 3 is a block diagram of step S3 in this embodiment of the present invention.

Fig. 4 is a diagram of the ASPP sub-module of step S3 in the embodiment of the present invention.

Fig. 5 is a block diagram of step S4 in the embodiment of the present invention.

FIG. 6 is a diagram of the attention mechanism submodule of step S4 in an embodiment of the present invention.

FIG. 7 is a graph showing the results of the example of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used only for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention. Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art through specific situations.

The invention provides a 360-degree image salient object detection method based on a convolutional neural network, which comprises the following steps of:

s1, image conversion

S1-1, creating a dataset of equidistant 360 degree images.

It should be noted that, in this embodiment, the adopted data set is a public 360-SOD public data set, which contains 500 equidistant 360-degree images with high resolution and their corresponding saliency maps, and the salient objects in the images are mostly people, in this embodiment, 400 of the data sets are adopted as the training data set, and 100 of the data sets are adopted as the test data set to perform the training, testing and evaluation work of the model, and meanwhile, to ensure the consistency of the input data, the input equidistant 360-degree image size is adjusted to 1024 × 512, and the cube projection image size is adjusted to 256 × 256.

S1-2, establishing an image conversion module, and generating a corresponding cube projection image from the equidistant 360-degree image by using the mapping relation between the equidistant projection and the cube projection.

Wherein, the expression of the mapping relation between the isometry projection and the cube projection is as follows:

q _i ＝R _fi ·p _i

is the x, y, z component of the q coordinate.

It will be appreciated that in the projected representation of an equi-spaced 360 degree image, the cube projection is typically represented as 6 faces, each face being a square with a side length w, 6 faces being up, down, front, back, left and right, respectively. Each face can be seen as an image taken independently by a camera with a focal length w/2 (field angle 90 deg.), the projected centers of the 6 cameras coinciding in a point, i.e. the center of the cube. If the world coordinate system origin is set at the cube center, the external parameters of the 6 cameras will be derived only from the rotation matrix R _fi Indicating that there is no translation vector. Given an imaging plane f in the camera system _i A point p on _i And it

Three-dimensional coordinates x, y, z

And S1-3, after the equidistant 360-degree images in the data set are read, generating corresponding cubic projection images by using an image conversion module.

S2, building a characteristic pyramid network, and extracting characteristics of the equidistant 360-degree image and the transformed cube projection image to obtain equidistant 360-degree image characteristics and cube projection characteristics.

Specifically, as shown in fig. 2, the method for constructing the feature pyramid network includes: FPN is adopted as a backbone network, wherein a bottom-up path is built based on Resnet-50.

And (3) acquiring the features of the image at different levels by using a Resnet 50-based feature pyramid network and performing weight sharing processing.

The image data input by the feature pyramid network comprises an equidistant 360-degree image and a cubic projection image, and the equidistant 360-degree image and the cubic projection image corresponding to the equidistant 360-degree image form an image sample. Feature extraction is carried out on seven input images of each image sample, namely, an isometric projection image and six face images of an isometric projection image, a cubic projection image, namely, an upper face image, a lower face image, a left face image, a right face image, a front face image and a rear face image, by adopting a feature pyramid network to obtain isometric image features and cubic projection features,

it should be noted that, in the embodiment, since the model training is performed by using the dual-type mixed data, a single sample contains one isometric projection image and six cube projection images, respectively, and the module needs to perform feature extraction on seven images, respectively, so that a set of seven groups of features is finally output.

It should be noted that, the feature pyramid network constructed in this embodiment is used for extracting features, and those skilled in the art can easily obtain the feature pyramid network, specifically as shown in fig. 2, including top-level convolution ResNet50 and 4 convolution layers, the step lengths are 4, 8, 16, and 32, respectively.

Further, the feature extraction method comprises the following steps:

S3, as shown in fig. 3, four identical feature aggregation modules are used to output a set of four groups of features, each feature aggregation module converts the cube projection feature into an equidistant feature by a feature conversion submodule (C2E feature conversion module), combines the feature with the feature of an equidistant 360-degree image, and then uses an empty convolution pooling pyramid submodule (ASPP submodule) to optimize the feature;

the conversion method of the characteristic conversion submodule comprises the following steps: and converting the 6 cube projection features into isometric projection features by utilizing the mapping relation between the isometric image features and the cube projection features.

It should be noted that: the expression of the mapping relation between the cubic projection feature and the isomorphic projection feature is as follows:

R _fi ·p _i ＝q _i

is the x, y, z component of the q coordinate.

It should be noted that: in the present embodiment, the feature conversion is performed by the C2E feature conversion module, which is a conventional technical means, and therefore, the C2E feature conversion module is not specifically described and illustrated, and specifically refer to fig. 3.

Further, as shown in fig. 4, the optimization method of the void convolution pooling pyramid sub-module includes: the method comprises the following steps of performing convolution parallel sampling on given input holes with different sampling rates, splicing obtained results together, expanding the number of channels, reducing the number of channels to an expected value through convolution with 1 x 1, which is equivalent to capturing the context of an image in a plurality of proportions.

It should be noted that: in this embodiment, feature optimization is performed by using a cavity convolution pooling pyramid sub-module (APSP sub-module), which is a conventional technical means, and specifically, referring to fig. 4, the feature optimization includes 3 1 × 1 convolution layers, 3 × 3 convolution layers, 1 × 1 pooling layers, and an upsampling layer, where sampling rates of the 3 × 3 convolution layers are 6, 12, and 18, respectively.

S4, as shown in FIG. 5, connecting and feeding multi-level aggregation features to the attention integration module, adaptively selecting reliable space and channel information by inferring a space and channel attention mechanism to fuse with the original features to obtain final features and complete the detection of the salient object.

It should be noted that: in this embodiment, feature fusion is performed through the attention integration module, which is a conventional technical means, and specifically refer to fig. 5, including 3 1 × 1 convolutional layers, 1 3 × 3 convolutional layer, a space attention module, and a channel attention module. The space attention module and the channel attention module are conventional in the art, and therefore, they are not specifically described and illustrated in this embodiment.

As shown in fig. 6, the spatial attention mechanism in the present network first performs dimension reduction on the channels themselves, splices them into a one-dimensional feature map, and then uses a convolutional layer to learn the overall spatial attention and feeds it to the four channels for integration. The channel attention mechanism uses the maximum pooling algorithm and the mean pooling algorithm for the four-channel overall feature map at the same time, then obtains a conversion result through the convolution layer, and finally applies the conversion result to all the channels respectively to obtain the attention value of each channel.

In this embodiment, a network model is constructed using a pytorre framework, the sum of cross entropy loss and mean absolute error loss is used as a loss function, the weight of the feature extraction module is initialized by training a ResNet-50 model in advance on ImageNet, and the weight of the newly added convolutional layer is initialized by using a normal distribution method proposed by the hodcamine. And training the model end to end by using a Stochastic Gradient Descent (SGD) algorithm. Training batch was set to 4, momentum was 0.9, weight decay was 0.0005, initial learning rate was set to 0.002, and final training round was 40 epochs. The model generates a salient object prediction map for a 360 degree image. The prediction map is a grayscale map of pixel values 0 to 1. In the figure, 1 indicates a region where a salient object is located, and 0 indicates a background region.

As can be seen from fig. 7, the present embodiment is improved on the basis of the existing conventional image salient object detection method, so that the present embodiment can be adapted to an equidistant 360-degree image for detection, and a better detection effect is obtained. The network consists of four large modules, including a data processing module (E2C image conversion module) and three network structure modules (feature pyramid network, feature aggregation module, attention mechanism module). The image conversion module completes conversion from an isometric 360-degree image to a cubic projection image, is used for constructing dual-type input data required to be used in a network, and avoids poor distortion of spherical-to-plane projection caused by single isometric image input by taking the dual-type data as input. The FPN feature extraction module extracts multi-level features of various input data and realizes weight sharing, the feature aggregation module integrates and optimizes the multi-level features, and the attention mechanism integration module is used for realizing final reliability weight selection and screening to obtain high-quality significance images. The result is a gray scale image with the pixel value of [0, 1], wherein 1 in the image is represented as the area where the salient object is located in the 360-degree image, and 0 in the image is represented as the background area, and the salient object detection task of the 360-degree image is successfully realized.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments, including the components, without departing from the principles and spirit of the invention, and still fall within the scope of the invention.

Claims

1. A360-degree image salient object detection method based on a convolutional neural network is characterized by comprising the following steps:

s1, image conversion

S1-1, creating a data set of equidistant 360-degree images;

s1-2, establishing an image conversion module;

s2, building a characteristic pyramid network, and extracting characteristics of the equidistant 360-degree image and the converted cube projection image to obtain equidistant 360-degree image characteristics and cube projection characteristics;

s3, four identical feature aggregation modules are adopted, each feature aggregation module is used for completing conversion from cube projection features to equidistant features through a feature conversion submodule, and is combined with the equidistant 360-degree image features, and then a cavity convolution pooling pyramid submodule is used for achieving optimization of combined features, so that multi-level aggregation features are obtained;

and S4, connecting and feeding the multi-level aggregation features to an attention integration module, adaptively selecting reliable space and channel information to be fused with the original features through deducing a space and channel attention mechanism to obtain final features, and completing the detection of the obvious target.

2. The convolutional neural network-based 360-degree image salient object detection method as claimed in claim 1, wherein in the step S1-2, the isometric 360-degree image is generated into a corresponding cube projection image by using the mapping relationship between the isometric projection and the cube projection.

3. The convolutional neural network-based 360-degree image salient object detection method as claimed in claim 2, wherein the expression of the mapping relationship of the isomorphic projection and the cubic projection is as follows:

q _i ＝R _fi ·p _i

is the x, y, z component of the q coordinate, R _fi Representing a rotation matrix, f _i For a given imaging plane, p _i For a known imaging plane f _i Point of (1), x, y, z tableShows p _i Is determined by the three-dimensional coordinates of (a),

4. the convolutional neural network-based 360-degree image salient object detection method as claimed in claim 1, wherein the image data input by the feature pyramid network comprises an equidistant 360-degree image and a cubic projection image, and the equidistant 360-degree image and the cubic projection image corresponding to the equidistant 360-degree image form an image sample.

5. The 360-degree image salient object detection method based on the convolutional neural network as claimed in claim 4, wherein the method for constructing the feature pyramid network is as follows: FPN is adopted as a backbone network, wherein a bottom-up path is built based on Resnet-50.

6. The method for detecting the salient object in the 360-degree image based on the convolutional neural network as claimed in claim 5, wherein in step S2, the feature extraction method is as follows:

7. The convolutional neural network-based 360-degree image salient object detection method of claim 1, wherein in the step S3, a set of four groups of features is output through four identical feature aggregation modules.

8. The convolutional neural network-based 360-degree image salient object detection method as claimed in claim 6, wherein the feature transformation sub-module is used for transforming: and converting the 6 cube projection features into isometric projection features by using the mapping relation between the cube projection features and the isometric image features.

9. The convolutional neural network-based 360-degree image salient object detection method as claimed in claim 8, wherein the optimization method of the hole convolutional pooling pyramid sub-module is as follows: the method comprises the following steps of performing convolution parallel sampling on given input holes with different sampling rates, splicing obtained results together, expanding the number of channels, reducing the number of channels to an expected value through convolution with 1 x 1, which is equivalent to capturing the context of an image in a plurality of proportions.