CN106157319A

CN106157319A - The significance detection method that region based on convolutional neural networks and Pixel-level merge

Info

Publication number: CN106157319A
Application number: CN201610604732.0A
Authority: CN
Inventors: 邬向前; 卜巍; 唐有宝
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2016-07-28
Filing date: 2016-07-28
Publication date: 2016-11-23
Anticipated expiration: 2036-07-28
Also published as: CN106157319B

Abstract

The invention discloses a kind of region based on convolutional neural networks and the significance detection method of Pixel-level fusion, the method research to as if still image, wherein the content of image can be arbitrary, the target of research is the target found out from image and attract human eye attention, and gives different significance value for it.The present invention mainly proposes a kind of adaptive Area growth, and devises two CNN network structures, is respectively used to the prediction of Pixel-level significance and significance merges.The two CNN network model, using image as input, is used for the training of network model, and finally output and input picture Saliency maps of the same size using the legitimate reading of image as supervisory signals.The present invention can effectively carry out region class significance and estimate and Pixel-level significance is predicted, obtains two Saliency maps, finally uses the CNN carrying out significance fusion to carry out merging by two Saliency maps and original image and obtain final Saliency maps.

Description

The significance detection method that region based on convolutional neural networks and Pixel-level merge

Technical field

The present invention relates to a kind of image processing method based on degree of depth study, be specifically related to a kind of based on convolutional neural networks Region and Pixel-level merge significance detection method.

Background technology

Along with development and the rise of degree of depth study, significance detection technique also development based on degree of depth study.Significantly Property detection can be divided into bottom-up data-driven model and the big class of top-down task-driven model two.Bottom-up aobvious The detection of work property refers to for given any piece image, finds out attractive target in figure, and this target can be any classification Things.And top-down significance detection method generally finds out the target of given classification from given picture, and give not Same significance value.At present, the most study to bottom-up significance detection method.

Existing bottom-up significance detection method can be divided into two classes, is respectively method based on hand-designed feature With method based on convolutional neural networks.Owing to method based on hand-designed feature generally utilizes the surface information in image (such as color, texture etc.) carries out feature extraction, and these the manual features extracted can not catch the deep layer of significance target Characteristic and multi-scale information, thus good performance can not be obtained.Recently along with the rise of degree of depth study, part researcher is had to open Begin to use convolutional neural networks to carry out significance target detection.The most existing it is mostly based on convolutional neural networks (CNN) Significance object detection method first divide the image into into multiple region, then to each region CNN model trained Predict its significance value.But the significance that these methods can not obtain accurate Pixel-level predicts the outcome.

Summary of the invention

In order to overcome the problems referred to above, the present invention proposes a kind of new significance detection method based on CNN, i.e. based on volume The significance detection method that the region of long-pending neutral net and Pixel-level merge.The method comprises three phases, respectively region class Significance is estimated, Pixel-level significance is predicted and significance merges, and each stage is directed to a CNN model.The method can Obtain the Saliency maps of accurate Pixel-level, it is thus possible to more effectively promote the development of related application based on significance.

It is an object of the invention to be achieved through the following technical solutions:

The invention provides a kind of region based on convolutional neural networks and the significance detection method of Pixel-level fusion, the method Research to as if still image, wherein the content of image can be arbitrary, and the target of research is to find out attraction from image The target of human eye attention, and give different significance value for it.It is raw that the present invention mainly proposes a kind of adaptive region One-tenth technology, and devise two CNN network structures, it is respectively used to the prediction of Pixel-level significance and significance merges.The two CNN Network model is using image as input, using the legitimate reading of image as supervisory signals for the training of network model, and finally Output and input picture Saliency maps of the same size.The present invention can effectively carry out region class significance and estimate to show with Pixel-level Work is predicted, obtains two Saliency maps, finally uses and carry out the CNN of significance fusion by two Saliency maps and original image Carrying out merging and obtain final Saliency maps, whole system block diagram is as shown in Figure 1.

The present invention to be embodied as step as follows:

One, region class significance is estimated

The first step, use adaptive Area growth that input picture I is split

(1) use SLIC algorithm that input picture I is carried out super-pixel segmentation, obtain n super-pixel；

(2) from each super-pixel, extract a simple characteristic vector, be used for characterizing the characteristic of this super-pixel；

(3) use an agglomerative clustering algorithm based on figure that super-pixel is clustered and obtain different regions；

Second step, use Clarifai network model carry out region significance estimation

(1) around the centrage in each region, m super-pixel is randomly selected；

(2) center m window of center construction as window of m super-pixel is set, and these windows comprise whole image；

(3) by m video in window of structure through CNN model, m significance value is obtained；

(4) average of m significance value is calculated and as the significance value in this region；

Two, Pixel-level significance prediction

(1) using model VGGNet as pre-training model, by last module removal in VGGNet, to the 4th and the 5th mould The output of block carries out operation of deconvoluting, and they is spliced in feature channel direction, learns for Analysis On Multi-scale Features；Then Using size is that the convolution kernel of 1* 1 carries out convolution and obtains a probability graph spliced characteristic pattern；

(2) in the Pixel-level CNN model training stage, the mistake between fork entropy loss function calculating probability figure and legitimate reading figure is used Difference, and carry out returning error to update Pixel-level CNN model parameter；

(3), after Pixel-level CNN model training is complete, it is directly inputted to input picture I in Pixel-level CNN model predict that it is corresponding Pixel-level Saliency maps；

Three, significance merges

(1) fusion CNN network structure is built: CNN network structure comprises a splicing layer, three convolutional layers and a loss layer；

(2) two Saliency maps of input picture I and step one, two are spliced into the image of 5 passages, are then sent to Three convolutional layers；

(3) merge the CNN network training stage, use the fork entropy loss function in loss layer to calculate the defeated of last convolutional layer Go out the error between true Saliency maps, and carry out returning by error to update fusion CNN model parameter；

(4), during test, input picture I is directly inputted in the fusion CNN model trained, last convolutional layer of this model Output be the Saliency maps of final prediction.

Present invention have the advantage that

1, the present invention proposes a kind of new significance detection method based on CNN, and it is notable that the method has given full play to region class Property estimate and Pixel-level significance prediction advantage, and achieve good significance detection performance.

2, the present invention proposes a kind of adaptive Area growth, and this technology can be that different images generates different number The region of amount, and can well keep the edge of object.

3, the present invention devises a CNN network structure, and this network structure can excavate the multiple dimensioned letter in image effectively Breath, in addition to can be used for the prediction of Pixel-level significance, also can carry out the task relevant to pixel classifications, such as image segmentation.

4, the present invention proposes a kind of new significance convergence strategy based on CNN, not only takes full advantage of Saliency maps Between complementary information, also use in original image abundant information, thus improve significance detection to a great extent Performance.

Accompanying drawing explanation

Fig. 1 is whole system block diagram of the present invention；

Fig. 2 is that adaptive region generates result example, a-original image, b-legitimate reading, c-super-pixel segmentation result, d-region Generate result；

Fig. 3 is region class significance estimated result example, a-original image, b-legitimate reading, c-region class result；

Fig. 4 is Pixel-level CNN network structure；

Fig. 5 is that Pixel-level significance predicts the outcome example, a-original image, b-legitimate reading, c-Pixel-level result；

Fig. 6 is for merging CNN network structure；

Fig. 7 is the result that the present invention carries out significance detection, a-original image, b-legitimate reading, c-fusion results, d-Pixel-level As a result, e-region class result.

Detailed description of the invention

Below in conjunction with the accompanying drawings technical scheme is further described, but is not limited thereto, every to this Inventive technique scheme is modified or equivalent, without deviating from the spirit and scope of technical solution of the present invention, all should contain In protection scope of the present invention.

The invention provides a kind of region based on convolutional neural networks and the significance detection method of Pixel-level fusion, tool It is as follows that body implements step:

One, region class significance is estimated

In region class significance estimation procedure, wherein the first step generates substantial amounts of region exactly from input picture.The simplest Method be use super-pixel as region to carry out significance estimation so that how to determine segmentation super-pixel number become Obtain highly difficult.If super-pixel number is very little, may be by less divided so that belong to same significance mesh target area. If super-pixel number is too many, so that the region belonging to significance target or background may be by over-segmentation.Either owe Segmentation or over-segmentation, the significance value that all may make significance target or background is inconsistent.Therefore, for different figures Picture, due to their different qualities, it should be divided into the super-pixel of varying number.In order to solve the problems referred to above, the present invention carries Go out a kind of adaptive Area growth to carry out image segmentation.A given input picture I, this adaptive region generates skill The process of art is as follows:

(1) use SLIC algorithm that I carries out super-pixel segmentation and obtain n super-pixel.Consider effect and the efficiency of method, In the present invention, n=300.

(2) from each super-pixel, extract a simple characteristic vector and (contain the average color on Lab color space With average locus coordinate), it is used for characterizing the characteristic of this super-pixel.

(3) use an agglomerative clustering algorithm based on figure that super-pixel is clustered and obtain different regions.

After said process, in image I color similarity and adjacent super-pixel be generally clustered same district In territory.For different images, the areal obtained after final cluster is also different, and will be far smaller than super-pixel Number n.Fig. 2 provides the result example of three generations that adaptive region generation technique obtains.

After obtaining the region generated, next step is that region significance is estimated.The present invention uses Clarifai network model (it is the CNN model that in ImageNet2013, image classification task obtains top performance) carries out region significance estimation.Tool For body, around the centrage in each region, first randomly select m super-pixel, then the center of this m super-pixel is set Center as window builds m window, and these windows contain whole image.Select the super picture around regional center line Element is that (2) are from the window of zones of different in order to (1) makes the border as far away from region, the center of constructed window In content the most different.In the present invention, when the number of the super-pixel comprised in region is more than 5, m=5 is set, otherwise, The value of m is set to the number of super-pixel.From the foregoing, it will be observed that for each region, m video in window will be constructed, through CNN model After, m significance value will be obtained, calculate their average and as the significance value in this region, so that this region Significance value to noise more robust.Fig. 3 gives the result example that three region class significances are estimated.

Two, Pixel-level significance prediction

Although region class significance is estimated to obtain the consistent and good Saliency maps of edge holding, but can not obtain Pixel-level The Saliency maps of precision.To this, the present invention proposes a kind of CNN network structure (being designated as Pixel-level CNN) and shows for carrying out Pixel-level Work is predicted.This Pixel-level CNN is with original image for input, with the Saliency maps with the size such as original image for output.In order to Obtaining the prediction of accurate significance, this CNN structure should be deep layer and the multiple dimensioned rank having different stride (strides) Section, to such an extent as to learn to the strong Analysis On Multi-scale Features of discriminating power for image pixel.When training sample scale is less, accent to open Beginning effectively to train such a network structure is a very difficult task.The such issues that of in order to overcome, one well Way is exactly model (such as those nets extremely successful on ImageNet using some to train on large-scale dataset Network model VGGNet and GoogleNet) as pre-training model, then in the small data set in needing of task, model is entered Row fine setting, thus can train and obtain a strong model of learning capacity.

It is modified on the basis of VGGNet model by the present invention, thus builds this Pixel-level CNN model.VGGNet Being made up of six modules (block), first five module is made up of convolutional layer (being designated as conv) and pond layer (being designated as pooling), As shown in Figure 4.Last module is made up of a pond layer and two full articulamentums.The present invention is by last in VGGNet Individual module removal.In order to utilize the multi-scale information of image, the present invention merges the output of the 4th and the 5th module and realizes many Scale feature learns.Owing to the output of latter two module varies in size and much smaller than the size of original image, therefore to make This Pixel-level CNN model can learn Analysis On Multi-scale Features automatically to be predicted for Pixel-level significance, and the present invention first will be to last two The output of individual module carries out operation (being designated as deconv) of deconvoluting so that their size keeps consistent with original image, and They are carried out splicing (being designated as concat) by feature channel direction.Then using size is that the convolution kernel of 1* 1 is to spliced spy Levying figure to carry out convolution and obtain a probability graph, in this probability graph, value means the most greatly the most notable.When test, this probability graph It is actually the Saliency maps of input picture.When training, fork entropy loss function (being designated as loss) is used to calculate this probability Error between figure and legitimate reading figure, and carry out returning to update model parameter by error.Arrive this, whole Pixel-level CNN Network structure has the most all built, as shown in Figure 4.In the model training stage, the stochastic gradient descent algorithm of standard by with Minimize loss function.After model training is complete, it is directly inputted to image in model predict that the Pixel-level of its correspondence shows Work property figure.Fig. 5 gives the example that three Pixel-level significances predict the outcome.

Three, significance merges

For a given image, said process can effectively obtain two Saliency maps, be respectively as follows: region class significance Figure and Pixel-level Saliency maps.Owing to they are that the CNN model that make use of the different information in image is calculated, therefore they There is complementarity.If can effectively it be merged, the performance of significance detection must be will further improve.

The present invention designs a simple CNN network structure (be designated as merge with CNN) and learns a kind of nonlinear transformation and fill Divide the complementary information between excavation regions level Saliency maps and Pixel-level Saliency maps, thus reach to put forward high performance purpose.Should CNN network structure contains splicing layer (concat), three convolutional layers (conv) and a loss layer (loss), such as Fig. 6 Shown in.First original image and its two Saliency maps are spliced into the image of 5 passages, are then sent to follow-up Three convolutional layers (concrete configuration is shown in Fig. 6).When test, the output of last convolutional layer is the significance of final prediction Figure.When training, the fork entropy loss function in loss layer is used to calculate the output of last convolutional layer and true significance Error between figure.From the foregoing, it will be observed that in the significance fusion method that the present invention proposes, obtain Saliency maps except using two Outward, original image is also used.This is because the information that introducing original image enriches can be corrected some and cannot only be used significantly Property the mistake corrected when merging of figure.

This fusion CNN can individually train, it is possible to carries out combination learning to obtain lastness with CNN network above The lifting of energy.In region class significance is estimated, start most to need from input picture, generate multiple region, then use region Level CNN carries out significance value estimation to each region.Pixel-level CNN and fusion CNN are then direct using image as input, and directly Connect output and obtain Saliency maps, be therefore a process end to end.Therefore, it is difficult to above three CNN network structure is incorporated In a unified network, and carry out combination learning end to end.In order to simplify this process, finally, three CNN are the most advanced Row individually training, then, Pixel-level CNN and fusion CNN carry out combination learning, such as Fig. 1 institute on the basis of pre-training further Show.During test, image is inputted the framework in Fig. 1, merge the Saliency maps that CNN is output as finally predicting.Fig. 7 gives The testing result example of the significance detection method of four present invention propositions, as shown in Figure 7, the result of present invention detection is with true Result very close to, thus illustrate effectiveness of the invention.

Claims

1. the significance detection method that a region based on convolutional neural networks and Pixel-level merge, it is characterised in that described side Method step is as follows:

One, region class significance is estimated

The first step, use adaptive Area growth that input picture I is split

(1) around the centrage in each region, m super-pixel is randomly selected；

Two, Pixel-level significance prediction

Three, significance merges

The significance detection method that region based on convolutional neural networks the most according to claim 1 and Pixel-level merge, It is characterized in that described n=300.