CN106157319B

CN106157319B - The conspicuousness detection method in region and Pixel-level fusion based on convolutional neural networks

Info

Publication number: CN106157319B
Application number: CN201610604732.0A
Authority: CN
Inventors: 邬向前; 卜巍; 唐有宝
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2016-07-28
Filing date: 2016-07-28
Publication date: 2018-11-02
Anticipated expiration: 2036-07-28
Also published as: CN106157319A

Abstract

The invention discloses the conspicuousness detection methods that a kind of region based on convolutional neural networks and Pixel-level are merged, the object of this method research is still image, wherein the content of image can be arbitrary, the target of research is the target for attracting human eye attention to be found out from image, and different significance value is assigned for it.The present invention mainly proposes a kind of adaptive Area growth, and devises two CNN network structures, is respectively used to the prediction of Pixel-level conspicuousness and conspicuousness fusion.The two CNN network models are used for the training of network model, and final output and input picture Saliency maps of the same size using the legitimate reading of image as supervisory signals using image as input.The estimation of region class conspicuousness and the prediction of Pixel-level conspicuousness can be effectively performed in the present invention, obtain two Saliency maps, finally merged to obtain final Saliency maps by two Saliency maps and original image using the CNN for carrying out conspicuousness fusion.

Description

The conspicuousness detection method in region and Pixel-level fusion based on convolutional neural networks

Technical field

The present invention relates to a kind of image processing methods based on deep learning, and in particular to one kind being based on convolutional neural networks Region and Pixel-level fusion conspicuousness detection method.

Background technology

With the development and rise of deep learning, the conspicuousness detection technique based on deep learning also continues to develop.Significantly Property detection can be divided into bottom-up data-driven model and top-down task-driven model two major classes.It is bottom-up aobvious The detection of work property refers to finding out attractive target in figure, which can be any classification for given arbitrary piece image Things.And top-down conspicuousness detection method usually finds out the target of given classification from given picture, and assign not Same significance value.Currently, to the most study of bottom-up conspicuousness detection method.

Existing bottom-up conspicuousness detection method can be divided into two classes, the respectively method based on hand-designed feature With the method based on convolutional neural networks.The surface information in image is usually utilized due to the method based on hand-designed feature （Such as color, texture etc.）These manual features to carry out feature extraction, and extract cannot capture the deep layer of conspicuousness target Characteristic and multi-scale information, to which good performance cannot be obtained.Recently with the rise of deep learning, there is part researcher to open Begin to carry out conspicuousness target detection using convolutional neural networks.It is existing at present to be mostly based on convolutional neural networks（CNN） Conspicuousness object detection method divide the image into multiple regions first, then to the trained CNN models in each region To predict its significance value.But these methods cannot obtain the conspicuousness prediction result of accurate Pixel-level.

Invention content

In order to overcome the above problem, the present invention to propose a kind of new conspicuousness detection method based on CNN, that is, be based on volume The conspicuousness detection method in the region and Pixel-level fusion of product neural network.This method includes three phases, respectively region class Conspicuousness estimation, the prediction of Pixel-level conspicuousness and conspicuousness fusion, each stage are directed to a CNN model.This method can The Saliency maps of accurate Pixel-level are obtained, so as to more effectively push the development of the related application based on conspicuousness.

The purpose of the present invention is what is be achieved through the following technical solutions：

The present invention provides the conspicuousness detection methods that a kind of region based on convolutional neural networks and Pixel-level are merged, should The object of technique study is still image, and wherein the content of image can be arbitrary, and the target of research is found out from image Attract the target of human eye attention, and assigns different significance value for it.The present invention mainly proposes a kind of adaptive area Domain generation technique, and two CNN network structures are devised, it is respectively used to the prediction of Pixel-level conspicuousness and conspicuousness fusion.This two A CNN network models are used for the training of network model using the legitimate reading of image as supervisory signals using image as input, and Final output and input picture Saliency maps of the same size.The estimation of region class conspicuousness and pixel can be effectively performed in the present invention Grade conspicuousness prediction, obtains two Saliency maps, finally using carrying out the CNN of conspicuousness fusion by two Saliency maps and original Image is merged to obtain final Saliency maps, and whole system block diagram is as shown in Figure 1.

The specific implementation step of the present invention is as follows：

One, region class conspicuousness is estimated

The first step is split input picture I using adaptive Area growth

（1）Super-pixel segmentation is carried out to input picture I using SLIC algorithms, obtains n super-pixel；

（2）A simple feature vector is extracted from each super-pixel, for characterizing the characteristic of the super-pixel；

（3）Super-pixel is clustered to obtain different regions using an agglomerative clustering algorithm based on figure；

Second step carries out region significance estimation using Clarifai network models

（1）M super-pixel is randomly selected around the center line in each region；

（2）The m window of center construction of center as window of m super-pixel is set, and these windows include entire figure Picture；

（3）M video in window of construction is passed through into CNN models, obtains m significance value；

（4）Calculate the mean value of m significance value and the significance value as the region；

Two, Pixel-level conspicuousness is predicted

（1）Using model VGGNet as pre-training model, by the last one module removal in VGGNet, to the 4th and The output of five modules carries out operation of deconvoluting, and splices them in feature channel direction, learns for Analysis On Multi-scale Features； Then it uses size to carry out convolution to spliced characteristic pattern for the convolution kernel of 1* 1 and obtains a probability graph；

（2）In Pixel-level CNN model training stages, calculated between probability graph and legitimate reading figure using fork entropy loss function Error, and error is returned to update Pixel-level CNN model parameters；

（3）After Pixel-level CNN model trainings are complete, input picture I is directly inputted in Pixel-level CNN models and predicts it Corresponding Pixel-level Saliency maps；

Three, conspicuousness merges

（1）Structure fusion CNN network structures：CNN network structures include a splicing layer, three convolutional layers and a loss Layer；

（2）By input picture I and Step 1: two two Saliency maps are spliced into the image in 5 channels, then by it It is sent into three convolutional layers；

（3）In the fusion CNN network training stages, the last one convolutional layer is calculated using the fork entropy loss function in loss layer Output and true Saliency maps between error, and error is returned to update fusion CNN model parameters；

（4）When test, input picture I is directly inputted in trained fusion CNN models, the last one volume of the model The Saliency maps that the output of lamination is as finally predicted.

The invention has the advantages that：

1, the present invention proposes a kind of new conspicuousness detection method based on CNN, and this method has given full play to region class Conspicuousness estimates and the advantage of Pixel-level conspicuousness prediction, and achieves good conspicuousness detection performance.

2, the present invention proposes a kind of adaptive Area growth, which can be that different images generates different numbers The region of amount, and the edge of object can be kept well.

3, the present invention devises a CNN network structure, which can effectively excavate the multiple dimensioned letter in image Breath can also carry out and the relevant task of pixel classifications, such as image segmentation other than it can be used for the prediction of Pixel-level conspicuousness.

4, the present invention proposes a kind of new conspicuousness convergence strategy based on CNN, not only takes full advantage of Saliency maps Between complementary information, also use information abundant in original image, to largely improve conspicuousness detection Performance.

Description of the drawings

Fig. 1 is whole system block diagram of the present invention；

Fig. 2 is that adaptive region generates result example, and a- original images, b- legitimate readings, c- super-pixel segmentations are as a result, d- Area generation result；

Fig. 3 is region class conspicuousness estimated result example, a- original images, b- legitimate readings, c- region class results；

Fig. 4 is Pixel-level CNN network structures；

Fig. 5 is Pixel-level conspicuousness prediction result example, a- original images, b- legitimate readings, c- Pixel-level results；

Fig. 6 is fusion CNN network structures；

Fig. 7 is present invention progress conspicuousness detection as a result, a- original images, b- legitimate readings, c- fusion results, d- pictures Plain grade is as a result, e- region class results.

Specific implementation mode

Technical scheme of the present invention is further described below in conjunction with the accompanying drawings, however, it is not limited to this, every to this Inventive technique scheme is modified or replaced equivalently, and without departing from the spirit of the technical scheme of the invention and range, should all be covered In protection scope of the present invention.

The present invention provides the conspicuousness detection method that a kind of region based on convolutional neural networks and Pixel-level are merged, tools Body implementation steps are as follows：

One, region class conspicuousness is estimated

In region class conspicuousness estimation procedure, wherein the first step is exactly that a large amount of region is generated from input picture.Most Simple method is to carry out conspicuousness estimation using super-pixel as region, so that how to determine the super-pixel of segmentation Number becomes highly difficult.If super-pixel number is very little, may be owed so that belonging to same conspicuousness mesh target area Segmentation.If super-pixel number is too many, so that the region for belonging to conspicuousness target or background may be by over-segmentation.No matter It is less divided or over-segmentation, all the significance value of conspicuousness target or background may be made inconsistent.Therefore, for different Image, due to their different characteristics, it should be divided into the super-pixel of different number.To solve the above-mentioned problems, of the invention A kind of adaptive Area growth is proposed to carry out image segmentation.An input picture I is given, which generates The process of technology is as follows：

（1）Super-pixel segmentation is carried out to I using SLIC algorithms and obtains n super-pixel.Consider method effect and Efficiency, in the present invention, n=300.

（2）A simple feature vector is extracted from each super-pixel（Contain the average color on Lab color spaces With average spatial position coordinate）, for characterizing the characteristic of the super-pixel.

（3）Super-pixel is clustered to obtain different regions using an agglomerative clustering algorithm based on figure.

After the above process, the similar and adjacent super-pixel of color is usually clustered the same area in image I In domain.For different images, the areal obtained after final cluster is also different, and will be far smaller than super-pixel Number n.Fig. 2 provides the result example of adaptive region generation technique three obtained generation.

It is region significance estimation in next step after the region generated.The present invention uses Clarifai network models （It is the CNN models that image classification task obtains top performance in ImageNet2013）To carry out region significance estimation.Tool For body, m super-pixel is randomly selected around the center line in each region first, then the center of this m super-pixel is set M window is built as the center of window, and these windows contain whole image.Super picture around selection region center line Element be in order to（1）So that boundary of the center of constructed window as far away from region,（2）Window from different zones In content it is as different as possible.In the present invention, when the number for the super-pixel for including in region is more than 5, m=5 are set, otherwise, The value of m is set to the number of super-pixel.From the foregoing, it will be observed that for each region, m video in window will be constructed, by CNN models Afterwards, m significance value will be obtained, their mean value and the significance value as the region are calculated, so that the region Significance value it is more robust to noise.Fig. 3 gives the result example of three region class conspicuousnesses estimation.

Two, Pixel-level conspicuousness is predicted

Although the estimation of region class conspicuousness can obtain consistent and edge and keep good Saliency maps, picture cannot be obtained The Saliency maps of plain class precision.In this regard, the present invention proposes a kind of CNN network structures（It is denoted as Pixel-level CNN）For carrying out pixel Grade conspicuousness prediction.Pixel-level CNN is input with original image, to be output with the Saliency maps of the sizes such as original image. Accurate conspicuousness prediction, the CNN structures should be deep layer and possess different strides in order to obtain（strides）It is multiple dimensioned Stage, so that for image pixel study to the strong Analysis On Multi-scale Features of discriminating power.When training sample scale is smaller, accent Start effectively to train such a network structure to be a very difficult task.Such issues that in order to overcome, one very well Way be exactly use some trained models on large-scale dataset（For example those are extremely successful on ImageNet Network model VGGNet and GoogleNet）As pre-training model, then to model in the small data set in the task of needs It is finely adjusted, a strong model of learning ability is obtained so as to training.

The present invention modifies to it on the basis of VGGNet models, to build Pixel-level CNN models.VGGNet By six modules（block）Composition, first five module is by convolutional layer（It is denoted as conv）With pond layer（It is denoted as pooling）Composition, As shown in Figure 4.The last one module is made of a pond layer and two full articulamentums.The present invention is by last in VGGNet A module removal.In order to which using the multi-scale information of image, the output that the present invention merges the 4th and the 5th module is more to realize Scale feature learns.Since the output size of most latter two module is different and is much smaller than the size of original image, in order to make Pixel-level CNN models can learn Analysis On Multi-scale Features and predict that the present invention first will be to last two for Pixel-level conspicuousness automatically The output of a module carries out operation of deconvoluting（It is denoted as deconv）So that their size is consistent with original image, and Feature channel direction splices them（It is denoted as concat）.Then it is the convolution kernel of 1* 1 to spliced spy to use size Sign figure carries out convolution and obtains a probability graph, and in the probability graph, value means more greatly more notable.In test, the probability graph The actually Saliency maps of input picture.In training, fork entropy loss function is used（It is denoted as loss）To calculate the probability Error between figure and legitimate reading figure, and error is returned to update model parameter.This is arrived, entire Pixel-level CNN Network structure all complete by structure, as shown in Figure 4.In model training stage, the stochastic gradient descent algorithm of standard by with To minimize loss function.After model training is complete, image is directly inputted in model to predict that its corresponding Pixel-level is shown Work property figure.Fig. 5 gives the example of three Pixel-level conspicuousness prediction results.

Three, conspicuousness merges

For a given image, two Saliency maps can effectively be obtained by the above process, respectively：Region class is aobvious Work property figure and Pixel-level Saliency maps.Since the CNN models that they are the different information being utilized in image are calculated, They have complementarity.If can effectively be merged to it, the performance of conspicuousness detection will be further increased.

The present invention designs a simple CNN network structure（It is denoted as fusion CNN）And learn a kind of nonlinear transformation to fill Divide the complementary information between excavation regions grade Saliency maps and Pixel-level Saliency maps, high performance purpose is put forward to reach.It should CNN network structures contain a splicing layer（concat）, three convolutional layers（conv）With a loss layer（loss）, such as Fig. 6 It is shown.Original image and its two Saliency maps are spliced into the image in 5 channels first, are then sent to subsequent Three convolutional layers（Concrete configuration is shown in Fig. 6）.In test, conspicuousness that the output of the last one convolutional layer is as finally predicted Figure.In training, using the fork entropy loss function in loss layer come the output for calculating the last one convolutional layer and true conspicuousness Error between figure.From the foregoing, it will be observed that in conspicuousness fusion method proposed by the present invention, in addition to using two to obtain Saliency maps Outside, original image is also used.This is because introduce the abundant information of original image can correct it is certain can not be by using only notable The mistake that property figure is corrected when being merged.

Fusion CNN can be trained individually, also can carry out combination learning with the CNN networks of front to obtain lastness The promotion of energy.In the estimation of region class conspicuousness, most start to need to generate multiple regions from input picture, then using area Grade CNN carries out significance value estimation to each region.And Pixel-level CNN and fusion CNN then directly using image as input, and directly It connects output and obtains Saliency maps, therefore be a process end to end.Therefore, it is difficult to which above three CNN network structures are incorporated In the network unified to one, and carry out combination learning end to end.In order to simplify this process, finally, three CNN are advanced Row individually training, then, Pixel-level CNN and fusion CNN further carry out combination learning on the basis of pre-training, such as Fig. 1 institutes Show.When test, image is inputted into the frame in Fig. 1, the output for merging CNN is the Saliency maps finally predicted.Fig. 7 gives The testing result example of four conspicuousness detection methods proposed by the present invention, as shown in Figure 7, result that the present invention detects with it is true As a result very close to thus illustrating effectiveness of the invention.

Claims

1. a kind of conspicuousness detection method of region and Pixel-level fusion based on convolutional neural networks, it is characterised in that the side Steps are as follows for method：

Step 1: region class conspicuousness is estimated

The first step is split input picture I using adaptive Area growth

（2）The m window of center construction of center as window of m super-pixel is set, and these windows include whole image；

Step 2: Pixel-level conspicuousness is predicted

（1）Using model VGGNet as pre-training model, by the last one module removal in VGGNet, to the 4th and the 5th mould The output of block carries out operation of deconvoluting, and splices them in feature channel direction, learns for Analysis On Multi-scale Features；Then It uses size to carry out convolution to spliced characteristic pattern for the convolution kernel of 1* 1 and obtains a probability graph；

（2）In Pixel-level CNN model training stages, the mistake between probability graph and legitimate reading figure is calculated using fork entropy loss function Difference, and error is returned to update Pixel-level CNN model parameters；

（3）After Pixel-level CNN model trainings are complete, input picture I is directly inputted in Pixel-level CNN models and predicts its correspondence Pixel-level Saliency maps；

Step 3: conspicuousness merges

（2）By input picture I and Step 1: two two Saliency maps are spliced into the image in 5 channels, then it is sent to Three convolutional layers；

（3）In the fusion CNN network training stages, the defeated of the last one convolutional layer is calculated using the fork entropy loss function in loss layer Go out the error between true Saliency maps, and error is returned to update fusion CNN model parameters；

（4）When test, input picture I is directly inputted in trained fusion CNN models, the last one convolutional layer of the model The Saliency maps as finally predicted of output.

2. the conspicuousness detection method of the region and Pixel-level fusion according to claim 1 based on convolutional neural networks, It is characterized in that n=300.