WO2018023734A1

WO2018023734A1 - Significance testing method for 3d image

Info

Publication number: WO2018023734A1
Application number: PCT/CN2016/093637
Authority: WO
Inventors: 王旭; 张秋丹; 江健民; 赖志辉
Original assignee: 深圳大学
Priority date: 2016-08-05
Filing date: 2016-08-05
Publication date: 2018-02-08
Also published as: CN106462771A

Abstract

Disclosed is a significance testing method for a 3D image, comprising steps of: (1) respectively extracting depth characteristic vectors of a color image and a depth image on the basis of a convolutional neural network; (2) respectively generating significance maps of the depth image and the color image according to a three-layer neural network and the extracted depth characteristic vectors of the color image and the depth image; (3) linearly blending the significance maps of the color image and the depth image to obtain a significance map of a 3D image. According to the present invention, depth study characteristics of a color image and a depth image are extracted in a multi-scale area on the basis of a CNN model; a significance map of the depth image (or the color image) is generated using a trained NN model on the basis of a depth characteristic vector and a significance label of the area, the NN model being used as a classifier in this case; with the depth significance map and a color significance map as an input, a final significance map of a 3D image is generated using a linear blending method. The present testing method has advantages of small error and high precision.

Description

A method for detecting the significance of 3D images

Technical field

The invention belongs to the technical field of 3D image processing, and more particularly relates to a method for detecting the saliency of a 3D image.

Background technique

With the continuous development of the consumer electronics industry, 3D applications are becoming more and more popular in our daily lives. Compared to the traditional 2D visual experience, 3D applications can provide users with a deep perception and immersive viewing experience. However, there are still many open issues in the 3D process that need to be well resolved. In 3D research, the saliency detection of 3D images is a very basic problem. Its main purpose is to find the location of the region of interest to the human eye in a natural scene image. Moreover, he can be applied to various fields, for example, 3D video coding can be used to optimize bit allocation, space pooling in stereo image quality evaluation, and feature extraction in 3D object detection.

Most of the existing visual saliency detection models are related to 2D images. These models mainly estimate the saliency from the color image by manually extracting some underlying features (such as brightness, color, contrast, texture, etc.), and these models do not consider depth information. For example, Itti et al. proposed a saliency model for fast scene analysis, which mainly combines image features at multiple scales to estimate saliency. Bruce et al. introduced a saliency method based on information maximization, which is mainly used to apply Shannon's self-information theory when making significant estimates. Goferman et al. designed a context-aware saliency detection model to detect some areas of the image that can represent the scene. Yang et al. proposed a visual saliency model based on the top-down approach, mainly by adding conditional random fields and discriminant dictionary methods. However, these methods basically perform saliency detection for 2D images.

Therefore, these traditional saliency detection models are not able to accurately predict the location of an area of interest in a 3D scene when people are watching. In order to improve the accuracy of prediction, some researchers have proposed The depth information of a stereoscopic image needs to take into account its depth information. For example, Fang et al. proposed a framework to estimate the saliency of a stereoscopic image using the contrast of features such as color, brightness, texture, and depth. The model still uses the traditional method of manually extracting features to extract the underlying features and depth features when calculating stereo image saliency. Qi et al. proposed a 3D visual saliency detection model, mainly to manually extract the depth features from the generated disparity maps and extract the underlying features from the left and right views. Kim et al. describe a saliency prediction model for stereoscopic video, which combines some discrete underlying features, depth feature distributions, and high-level scene classifications. However, for these studies, the method of manually extracting features cannot effectively and accurately remove the original pixel extraction features of the hierarchical level, and there are many uncertain factors for manually extracting features, and some unpredictable errors may occur, and manual extraction is performed. Features often require a lot of manpower, but also rely on professional knowledge, and manual extraction is often not general. Therefore the performance of these models is limited.

Summary of the invention

Aiming at the defects of the prior art, the object of the present invention is to provide a method for detecting the saliency of a 3D image, which aims to solve the problem that the method of manually extracting features in the prior art cannot effectively extract features in the original pixel and cause large errors.

The invention provides a method for detecting the saliency of a 3D image, comprising the following steps:

(1) extracting depth feature vectors of color images and depth images;

(2) generating a saliency map of the depth map and the color map according to the three-layer neural network and the extracted depth image of the color image and the depth image;

(3) Performing a linear fusion process on the saliency map of the color image and the depth image to obtain a saliency map of the 3D image.

Further, step (1) is specifically:

(1.1) separately performing image segmentation on the color image and the depth image associated with the color image, and obtaining multi-level image regions without overlapping;

(1.2) The feature vector of the segmented color image and depth image is extracted by convolutional neural network model.

Further, the structure of the convolutional neural network model is five convolutional layers and three fully connected layers; for each layer of the network, a specific network parameter configuration is set: first, the picture input layer, setting input The size of the image is 227*227. Taking the convolution layer one as an example, the convolution kernel has a size of 11, a total of 96 convolution filters, a convolution stride of 4, and the number of pictures outputted is 96. The ReLUs and max-pooling operations are performed after the convolution layer one. Finally, there are three fully connected layers, which are classifiers for neural networks. The number of neurons in the fully connected layer 1 and layer 2 is 4096, and the number of neurons in the fully connected layer 3 is 1000.

Further, in step (2), a saliency map of the depth map and the color map is generated according to a neural network (NN) model; wherein the neural network (NN) model has one output layer and two full connections The hidden layer, the input of the neural network (NN) model is a feature vector, and the output is a saliency label of the current region. When the saliency label is 1, the current region is significant, and when the saliency label is 0, then Indicates that the current area is not significant.

Further, the saliency map of the depth image is represented by a formula

Generated, where x represents the area of the depth image

Pixels in,

Represents the weighting factor of the jth layer of the depth map, L represents the total number of layers, S _d (x) represents the saliency map of the depth image, and j represents the number of layers of the depth map,

a segmentation area indicating that the jth layer index of the depth map is i,

Represents a mapping function that primarily describes the local area of the depth map

The relationship between the feature vector and the saliency label of the region.

Further, the saliency map of the color image is represented by a formula

Generated, where x represents the area of the color image

Pixels in,

The weighting factor representing the jth layer of the color map, L is the total number of layers, and j is the number of layers of the color map.

a segmentation area indicating that the jth layer index of the color map is i,

Represents a mapping function that primarily describes the local area of the color map

Further, the saliency map of the 3D image is S=w·S _c +(1-w)·S _d , where S _d is a saliency map of the depth image, and S _c is a saliency map of the color image, w is the saliency map of the color map and contributes weight to the visual saliency map of the final 3D image.

Further, the total number L of layers is 15 and the weight w is 0.5.

The invention performs depth learning feature extraction of multi-scale regions on color images and depth images respectively based on a Convolutional Neural Network (CNN) model; the saliency map of depth images (or color images) is based on depth through the NN model The feature vector and the saliency label of the region are generated. The NN model is equivalent to the role of the classifier. A linear fusion method is used which combines the depth saliency map and the color saliency map to generate the final 3D image. The saliency map; small error and high precision.

DRAWINGS

1 is a schematic diagram of the framework of a method for detecting the significance of a 3D image provided by the present invention;

2 is a flow chart of implementing a saliency detection method for a 3D image according to an embodiment of the present invention;

FIG. 3 is a diagram showing an example of a comparative simulation of a 3D image saliency detection method and a prior art according to an embodiment of the present invention.

detailed description

The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As can be seen from the above description, the performance of a model for calculating a visual saliency map is largely affected by those representative features. Therefore, for 3D visual saliency studies, it is important to find those representative visual features. And because the saliency detection models of existing 3D images are basically based on manual methods to extract features. However, these research methods are difficult to distinguish the difference between the higher regions of the significant regions and their neighbors. In addition, due to the lack of knowledge about 3D visual perception, how the depth information contributes to the final visual saliency map is still not Very clear.

The saliency detection method of the 3D image provided by the present invention can be applied to the fields of video coding, video compression, image retrieval, image quality evaluation, detection of an object of interest, and image retrieval. The way it is applied is mainly based on the field of its application.

The framework of the visual saliency model based on deep learning features proposed by the present invention comprises three main steps, namely, extraction of depth features, generation of saliency maps, fusion of saliency maps, and the description of the framework is as shown in FIG. Show. First, the depth feature vectors of the color image and the depth image are extracted by a convolutional neural network (CNN) model. Then, the saliency maps of the depth map and the color map are generated by the generated region feature vector and the region's saliency label through a three-layer neural network. Finally, the saliency map of the 3D image is generated by a linear fusion of the saliency maps of the color image and the depth image.

FIG. 2 is a flowchart of a method for detecting a saliency of a 3D image according to an embodiment of the present invention, which specifically includes:

(1) Deep learning feature extraction

Based on the theoretical knowledge of the human visual system, the visual attention mechanism contains a hierarchical selection process from coarse to fine. Therefore, we perform multi-level segmentation of the image before feature extraction. Feature extraction is then performed for each of the segmented regions of each layer.

A. Multi-level image segmentation

In our research, we focus on a 3D image format based on depth maps, each of which is associated with a depth map. For each 3D image, we decompose the color map and its associated depth map into multiple levels of unoverlapping image areas. For convenience, we assume that the total number of levels is L. For each layer j, the color map Ic and the depth map I _{d are} divided into non-overlapping regions, respectively, which are represented as

with

among them

with

Respectively represent the color maps and depth maps of layer j index region m _j and n _j's, m _j and n _j represents the image area index layer j, and it is according to the most finely divided by the crudest to the way.

B. Feature extraction based on CNN

Due to the lack of depth image acquisition techniques, the number of publicly available benchmark data in the field of 3D significance detection is not sufficient. So, it is difficult to rely on these available data sets to train a precise CNN model, which is based on the color and depth maps of the 3D image and its saliency map as training data, and the 3D image saliency obtained through training. The neural network model of the graph. In feature extraction, we used a model that has been pre-trained, called a convolutional neural network, to extract the features of the depth and color maps. The model is trained on the ImageNet dataset and has five convolutional layers and three fully connected layers, which are a neural network model for image classification.

As far as we know, the significance of each local area is not only dependent on its own characteristics, it is often subject to the content of its neighborhood and its background information (that is, the rest of the area is removed). influences. Therefore, for each layer j of the segmentation of a depth image, we divide each local region of the layer

Its adjacent area

And the background area

The CNN model is used to extract the respective feature vectors. The local area we defined here is irregular because of its shape during the segmentation process. Therefore, we use a rectangular area as the border of the image area. Resize each area rectangle to a pixel size of 227x227 and feed it into the CNN model. The final output from each region is a 12288-dimensional feature vector and is represented as

Here. For a color image, its operation is consistent with the depth image, its local area

The feature vector is represented as

(2) Generation of saliency map

The output feature vector is just a sparse representation of the current local area. To determine if the current region is significant or not, we need to create a mapping function for the feature vector and the significance tag. We trained a neural network (NN) model with an output layer and two fully connected hidden layers. The eigenvector is the input to the neural network, and its output is the saliency label of the current region. A value of 1 indicates that the current region is significant, and a value of 0 indicates non-significant. The NN model is trained separately for the color image and the depth image. The mapping function between the saliency label and the feature vector of the region of the depth image and the color image is expressed as

with

All pixels in the same region share the same saliency tag, which is derived from ground truth data. Finally, the saliency map of the depth image is generated by (1):

The saliency map of the color image is generated by (2):

Where x represents the area of the depth image

And the area of the color image

The pixels in .

with

The weighting factors of the depth image and the color image are respectively indicated.

(3) Fusion and enhancement of saliency maps

In order to obtain a visual saliency map of a precise 3D image, it is necessary to fuse the depth saliency map with the color saliency map. After the saliency map generation step, we obtain the saliency maps of the depth map and the color map, respectively denoted as S _d and S _c , respectively, and generate a final saliency map of the 3D image by a linear fusion method. Calculated as follows:

S=w·S _c +(1-w)·S _d ......(3)

Where w is the contribution weight used to adjust the depth and color significance maps. The way to adjust is to set the final saliency map by setting the value between 0 and 1 for w. The w value is then adjusted by a series of evaluation indicators SIM, EMD and CC to determine the accuracy of the generated 3D image saliency map. This behavior is also known as the method of autoregression. At the same time, in order to further improve the performance of the model, we have adopted a currently widely used center bias mechanism to enhance the final saliency map.

The method provided by the present invention is a 3D visual saliency detection model based on deep learning features. First of all, the first advantage of this technology is that instead of using the traditional manual extraction feature, the deep convolutional neural network is used to extract the feature information of the color map and the depth map. The advantage of this is that the neural network extracts. Compared with manual extraction, the features are not accurate enough due to human factors, and the manual extraction of features is a large amount of engineering and consumes a lot of human resources. The second advantage is that when calculating the 3D image saliency map, the depth information will be Considering with color information, tradition Most of the saliency models are extracted for 2D images; the third advantage is that we use a neural network model NN to act as a regression, that is, through the extracted image region features and the region's saliency tags. The saliency value is estimated to generate a saliency map of the color map and the depth map.

In the embodiment of the present invention, there are many methods for image segmentation, including region growth, pixel clustering, and the like. There are also many convolutional neural network models for feature extraction, such as GoogleNet.

The proposed model is compared with the existing model, which includes Li's multi-scale based model (expressed as VSMD).

The wavelet domain-based model (represented as SDLL), the square 2D saliency model (denoted as SSDF2D), and the square 3D saliency model (denoted as SSDF3D) are used as the baseline model. The three models VSMD, SDLL and SSDF2D are mainly for the saliency calculation of 2D images, and the SSDF3D model is for 3D images. The total number L of layers of the model we proposed during the experiment was 15. w is set to 0.5.

To validate the performance of the 3D visual saliency detection model, we tested these models on two publicly available data sets that are currently widely used. The two representative data sets are the NUS3D-saliency data set and the NCTU-3DFixation data set. At the same time we used three evaluation criteria in the course of the experiment. They are Pearson Correlation Coefficient (CC), Earth Mover's Distance (EMD), and Similarity score (SIM). These are three criteria for evaluating the performance of the proposed model. Then a good model will have high CC and SIM scores, but EMD scores should be low. The proposed model achieves better performance than the other 2D saliency models. The two data sets NCTU and NUS are shown in Figure 1. For example, the CC, EMD, and SIM scores for our proposed model are 0.5225, 2.1547, and 0.4985, respectively. However, the CC, EMD and SIM scores for the VSMD model are only 0.3783, 2.8419 and 0.3812. The experimental results show that the performance of the 3D image saliency detection model benefits from the fusion of color saliency maps and depth saliency maps.

Table 1. Performance of the model on two data sets under CC, EMD and SIM criteria.

The results of the performance comparison between the proposed 3D model and the SSDF3D model on the two datasets are also given in Table 1. It is obvious that the CC, EMD and SIM scores of the proposed model on the NUS dataset are shown. It is better than the SSDF3D model. For the NCTU dataset, our model has fewer CC scores than the SSDF3D model, but other EMD and SIM scores are better than the SSDF3D model. For further explanation, we give some of the Jiangze samples of the model in Figure 2. It can also be seen that the proposed model can get the best performance.

As shown in Figure 3, from left to right are four sample images from the NUS data set. From the second line to the last line, the order of the model given is the SSDF2D model, the SDLL model, the VSMD model, the SSDF3D model, and the model we proposed. From this figure, we can clearly see the results of our model. The results are visually superior to the other models. The saliency areas in the original image are clearly detected and clear, and we can see that the proposed model can get the best performance.

The present invention proposes a visual saliency detection model of a 3D image based on deep learning features. There are three key factors in our approach. First, we use the CNN model to perform depth learning feature extraction for multi-scale regions on color images and depth images, respectively. Second, the saliency map of the depth image (or color image) is generated by the NN model based on the depth feature vector and the saliency label of the region, which is here equivalent to the role of the classifier. Finally, we use a linear fusion method that combines a depth saliency map with a color saliency map to generate a saliency map of the final 3D image. And we also used a central biasing mechanism to enhance the saliency map. Our proposed model achieves excellent performance on these two publicly available data sets.

Those skilled in the art will readily appreciate that the above description is only a preferred embodiment of the present invention, and It is intended that the present invention be construed as being limited by the scope of the invention.

Claims

A method for detecting the significance of a 3D image, comprising the steps of:

(1) extracting depth feature vectors of color images and depth images;

(2) generating a saliency map of the depth map and the color map according to the three-layer neural network and the extracted depth image of the color image and the depth image;

(3) Performing a linear fusion process on the saliency map of the color image and the depth image to obtain a saliency map of the 3D image.
The saliency detecting method according to claim 1, wherein the step (1) is specifically:

(1.1) separately performing image segmentation on the color image and the depth image associated with the color image, and obtaining multi-level image regions without overlapping;

(1.2) The feature vector of the segmented color image and depth image is extracted by convolutional neural network model.
The saliency detecting method according to claim 2, wherein the convolutional neural network model has five convolutional layers and three fully connected layers; and different network parameters are set for each layer of the network. Configuration.
The saliency detecting method according to claim 1, wherein in step (2), a saliency map of the depth map and the color map is generated according to the neural network model;

Wherein, the neural network model has an output layer and two fully connected hidden layers, the input of the neural network model is a feature vector, and the output is a saliency label of the current region, and when the saliency tag is 1, the current The area is significant, and when the significance label is 0, it indicates that the current area is non-significant.
The saliency detecting method according to claim 4, wherein the saliency map of the depth image is represented by a formula
Generated, where x represents the area of the depth image
Pixels in,
Represents the weighting factor of the jth layer of the depth map, L represents the total number of layers, S d (x) represents the saliency map of the depth image, and j represents the number of layers of the depth map,
The segmentation area representing the j-th layer index of the depth map is i,
Represents a mapping function that primarily describes the local area of the depth map
The relationship between the feature vector and the saliency label of the region.
The saliency detecting method according to claim 4, wherein the saliency map of the color image is represented by a formula
Generated, where x represents the area of the color image
Pixels in,
The weighting factor representing the jth layer of the color map, L is the total number of layers, and j is the number of layers of the color map.
a segmentation area indicating that the jth layer index of the color map is i,
Represents a mapping function that primarily describes the local area of the color map
The relationship between the feature vector and the saliency label of the region.
The saliency detecting method according to any one of claims 1 to 6, wherein the saliency map of the 3D image is S = w · S c + (1 - w) · S d , wherein S d is The saliency map of the depth image, S c is the saliency map of the color image, and w is the contribution weight of the saliency map of the color map in the visual saliency map of the final 3D image.
The saliency detecting method according to claim 7, wherein the total number L of stratifications is 15 and the weight w is 0.5.