CN107680106A

CN107680106A - A kind of conspicuousness object detection method based on Faster R CNN

Info

Publication number: CN107680106A
Application number: CN201710974083.8A
Authority: CN
Inventors: 王超; 李静; 刘铭坚
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2017-10-13
Filing date: 2017-10-13
Publication date: 2018-02-09

Abstract

The invention discloses a kind of conspicuousness object detection method based on Faster R CNN.This method carries out multi-scale division to image first, then possible conspicuousness target is outlined using Faster R CNN, establish like thing figure, connected afterwards by prospect and proportion of foreground is assigned in super-pixel, the proportion of conspicuousness Techniques of Optimum combination prospect and background is recycled to obtain round and smooth and smooth notable figure, finally carrying out fusion using multilayer cellular automata obtains final notable figure.Input picture is split on three yardsticks using super-pixel segmentation algorithm, super-pixel segmentation algorithm is according to low-level image features such as color, texture and brightness, adjacent similar pixel is polymerized to image-region of different sizes, the complexity of conspicuousness detection can be effectively reduced；Using the segmentation figure of different scale as one layer of cellular, the super-pixel segmentation figure of each different scale is merged using multilayer cellular automata, ensure that the uniformity of saliency testing result.

Description

A kind of conspicuousness object detection method based on Faster R-CNN

Technical field

The invention belongs to the conspicuousness detection field of image in computer vision, in particular to deep learning to be directed to certain kinds Conspicuousness detection method.

Background technology

In recent years, computer, internet and multimedia technology are fast-developing, and people can connect daily in work and life Touch substantial amounts of image, the information of video.It is that people receive information because image and video contain abundant and intuitively content Effective channel, it is one of important sources of information, for example, the detection of Online Video, Video chat, machine components, network direct broadcasting, intelligence Can monitor etc..Digital picture and video are all in exponential increase, just with manually being handled image or video, analyzed often With significant limitation.Computer vision is as a collection Intelligent Information Processing and the cross discipline of digital assay, simulation Human visual system carries out perception processing to image, obtains and the same or like result of artificial treatment so that computer can be with Analyzed as people and understand the true content expressed by image.Research shows that information of concern, which often focuses on, to be had In the target of vision significance., can be aobvious in input picture by only handling when needing amount of images to be processed more huge Work property target so that improve treatment effeciency in the case where not changing processing method.

The low-level image feature that vision system can identify is included with color, edge, texture etc..Carried out based on these features local Diversity between each region is calculated in contrast or global contrast, determines signal portion in image.Itti et al. is in vision On the basis of attention model KU, it is proposed that classical conspicuousness computation model, turn into a master die of existing algorithm comparison Type.The algorithm, for multi-scale image, phase is obtained by center-surrounding contrast's degree of low-level image feature according to human-eye visual characteristic The notable figure answered, and final notable figure is obtained by notable figure fusion.Harel et al. proposes that the conspicuousness detection based on graph theory is calculated Method GVBS, biological vision principle is combined with mathematical computations, Ma Erke is introduced among the notable figure generating process of Itti models Husband's chain, seek its balanced distribution with the calculating of pure mathematics and obtain notable figure.Cheng et al. is special by color using gauss hybrid models Levy similar pixel to gather for image-region, consider the color contrast and spatial distribution in each region, generated with probabilistic model Notable figure.Yang et al. calculates its color characteristic and space spy by dividing an image into multiple dimensioned figure layer, for each figure layer The contrast of sign, the notable figure for merging multiple figure layer generations obtain final notable figure.This method can ensure conspicuousness target Uniformity and integrality, but when conspicuousness target is smaller, conspicuousness target can be dissolved into background area as background.Use The conspicuousness detection method of different low-level image features often just for certain a kind of specific image significant effect, can not be applied to complicated field Multi-Target Image under scape.The low-level image feature that view-based access control model stimulates lacks the understanding to well-marked target essence, it is impossible to deeper Represent conspicuousness clarification of objective.It is such as similar to low-level image feature but be not belonging to same class for noise object present in image Target, well-marked target is often detected as by mistake.In recent years, the automatic study based on deep learning obtained depth characteristic The method of (or high-level characteristic) has begun to be applied in saliency detects.Li et al. passes through depth convolutional Neural net Network detects MDF, detection effect to learn to obtain the part and the global depth feature that obtain image superpixel region to carry out conspicuousness Fruit compared to conventional method have it is bright be obviously improved, but the speed of service is slower.Lee et al. is by extracting super-pixel block region and side Edge feature, it is sent into convolutional neural networks and learns to obtain notable confidence map.The region for seeking energy minimization using condition random field is entered Row conspicuousness detects.It is preferable to single well-marked target Detection results, but because feature selection issues are not suitable for multi-Target Image. Pan et al. carries out Pixel-level processing using a kind of method that gradient declines to original image, then carries out the depth spy of super-pixel level Sign extraction and saliency value calculate, boosting algorithm speed, but gradient descent method is poor to sparse matrix effect, causes whole detection Effect is inaccurate.Hu et al. obtains part and global characteristics by combining the priori of convolutional neural networks and area validation. Algorithm Detection results are preferable, but highly complex model have impact on algorithm operational efficiency.

The content of the invention

The present invention is a kind of conspicuousness object detection method based on Faster R-CNN, it is intended to solves existing conspicuousness inspection The Deep Semantics feature of image can not be extracted using low-level image feature by surveying in model, and the effect for causing conspicuousness to detect is not ideal The problem of.

The invention provides a kind of conspicuousness detection method based on depth characteristic, comprise the steps：

Step 1：Multiple dimensioned super-pixel segmentation.It is using SLIC super-pixel segmentations algorithm that input picture is enterprising in three yardsticks Row segmentation, SLIC super-pixel segmentation algorithms are according to low-level image features such as color, texture and brightness, and adjacent similar pixel is gathered Into image-region of different sizes, the complexity that conspicuousness calculates can be effectively reduced.It is to be based on color and distance similarity A kind of gradient descent algorithm of measurement, dividing number can control, and it is relatively uniform can to obtain the more regular size of shape Segmentation result.Split yardstick number and each yardstick under super-pixel quantity number, can all influence Detection results and operation is fast Degree.Too many segmentation number can increase computation complexity, and be likely to occur over-segmentation phenomenon.And super-pixel number is very little, then The accuracy of conspicuousness Detection results can be influenceed.Different segmentation yardsticks can influence the effect and the speed of service of subsequent detection, Therefore invention rule of thumb limits the number of pixel in each super-pixel and includes 100 pixels from each super-pixel unit, to 500 Between individual pixel, each yardstick is incremented by by 100 pixels.

Step 2：Obtain like thing figure., will by carrying out certain kinds target detection training to Faster R-CNN neutral nets Certain kinds target detection in image comes out, recycle its characteristic will likely Objective extraction come out.And by obtained target seemingly Physical property fraction extracts.Then, generation is started like thing figure.Faster R-CNN verification and measurement ratio is very high, but in extreme circumstances Certain kinds target can be can't detect, when Faster R-CNN can't detect target if using entire image as target at Reason.

Possibility that target has told us this window contains target like physical property fraction, using may target obtain picture Plain level like physical property fraction, the fraction is designated as the significant possibility of target, such as formula of the definition like physical property fraction of Pixel-level (1), wherein s_iFor either with or without the possible fraction of the target comprising pixel p, G_iFor Gaussian window, x, y are the coordinates of pixel p.

Then Pixel-level is that target is divided in super-pixel region like physical property like summation of the physical property fraction in super-pixel region Number, definition such as formula (2), p_iIt is a pixel for belonging to super-pixel region R.In order to obtain super-pixel region, we use SLIC, in super-pixel segmentation, we have chosen multiple segmentation yardsticks to be detected.We set a suitable threshold afterwards Value is set as that the pixel quantity like thing figure is more than 1.5 times divided by figure like object image prime number amount to obtain like thing figure, threshold value Total amount of pixels.

Objectness (R)=∑_i∈RPixObj(p_i) (2)

Step 3：Calculating prospect connects.Threshold value utilizes like the part that thing figure is coarse acquisition prospect super-pixel " prospect connects " (the Foreground Connectivity) that Srivatsa proposes with Babu, this method are utilized according to super-pixel Connective significance value is assigned to the prospect estimated.Figure, neighbouring super-pixel section are established by the use of super-pixel as node There is side between point, the weight definition on side is the Euclidean distance of the average Lab colors of two nodes.Define super-pixel R prospect Connection such as formula (3).Here d (R, R_k) represent R to R_kBetween beeline, δ () like thing figure if prospect is estimated as It is entered as the sum that 1, N is super-pixel.

It is more more more can guarantee that the value of molecule is lower and the value of denominator is higher to be estimated super-pixel similitude for prospect, so Son will make it that FG value is lower, which imply that having higher connectedness, so taking FG inverse as prospect weight.

Step 4：Conspicuousness optimizes.Directly our prospect weight and background are combined using existing optimization framework Weight, shown in the cost function such as formula (4) of minimum, t_iRepresent to be assigned to p after minimizing cost_iEnd value,Represent With super-pixel p_iRelevant prospect weight,Represent and super-pixel p_iRelevant background weight,More high then p_iIt is intended to 1, More high then p_iIt is intended to 0, w_ijIt is smoothing factor.

Step 5：Multilayer cellular automata merges.In the multiple dimensioned notable figure obtained by above step, because division yardstick Difference to obtain notable figure also not quite identical.Simultaneously because conspicuousness calculating is in units of super-pixel, well-marked target Saliency value is also piecemeal, discontinuous, so multiple dimensioned notable figure is using there are still differ after the fusion of in general method Cause, in order that final notable figure is consistent as far as possible, consensus optimization is carried out using multilayer cellular automata.

Brief description of the drawings

Fig. 1 is the conspicuousness detection method flow chart in depth characteristic；

Fig. 2 is CNN structural representations；

Fig. 3 is the depth characteristic extraction Organization Chart in convolutional neural networks；

Fig. 4 is lamination C1 output characteristic figures；

Fig. 5 is Fast R-CNN frame diagrams；

Fig. 6 is PR curve map of the algorithms of different on certain kinds data set and MAE block diagrams；

Fig. 7 is PR curve map of the algorithms of different on HKU-IS data sets and MAE block diagrams；

Fig. 8 is PR curve map of the algorithms of different on MSRA-1000 data sets and MAE block diagrams.

Embodiment

To be easy to understand the technical means, the inventive features, the objects and the advantages of the present invention, with reference to Embodiment, the present invention is expanded on further.

The present invention carries out multi-scale division to image first, then outlines possible conspicuousness mesh using Faster R-CNN Mark, establish like thing figure, connected afterwards by prospect and proportion of foreground is assigned in super-pixel, recycle conspicuousness Techniques of Optimum Round and smooth and smooth notable figure is obtained with reference to the proportion of prospect and background, finally carrying out fusion using multilayer cellular automata obtains Obtain notable figure finally.Specifically include following steps：

Step 1：Multiple dimensioned super-pixel segmentation.It is using SLIC super-pixel segmentations algorithm that input picture is enterprising in three yardsticks Row segmentation, SLIC super-pixel segmentation algorithms are according to low-level image features such as color, texture and brightness, and adjacent similar pixel is gathered Into image-region of different sizes, the complexity that conspicuousness calculates can be effectively reduced.It is to be based on color and distance similarity A kind of gradient descent algorithm of measurement, dividing number can control, and it is relatively uniform can to obtain the more regular size of shape Segmentation result.Split yardstick number and each yardstick under super-pixel quantity number, can all influence Detection results and operation is fast Degree.Too many segmentation number can increase computation complexity, and be likely to occur over-segmentation phenomenon.And super-pixel number is very little, then The accuracy of conspicuousness Detection results can be influenceed.Different segmentation yardsticks can influence the effect and the speed of service of subsequent detection, Therefore invention rule of thumb limits the number of pixel in each super-pixel and includes 100 pixels from each super-pixel unit, to 500 Between individual pixel, each yardstick is incremented by by 100 pixels.Input picture Image, all complexity can be reduced using gray-scale map, RGB coloured images can also be used, if using RGB color image, now input picture has three, respectively RGB component. Input picture generally requires normalization.Handled by convolutional layer (C)-down-sampling layer (S), by the output of last layer and this layer Weight W does convolution and obtains each convolutional layer, and then down-sampling obtains each down-sampling layer, and their output is referred to as characteristic pattern.Will Image rasterization (X).Each pixel in the characteristic pattern of last layer output is taken out successively, lines up a vector.Using multilayer Perceptron (N＆O) is finally handled, and features training grader typically uses Softmax, if two classification, can use logic Homing method.

Step 2：Obtain like thing figure.Faster R-CNN (convolutional neural networks based on region faster) are to be based on Fast R-CNN (the quickly convolutional neural networks based on region) improvement, and Fast R-CNN are then the R-CNN (volumes based on region Product neutral net) merged with SPPNet (spatial pyramid pond network).These four are all based on convolution god in image procossing Object detection method through network, the developmental sequence between them are R-CNN, SPPNet, Fast R-CNN, Faster successively R-CNN.R-CNN core concept is extracted using candidate region frame to replace traditional sliding window to detect, and then to each Feature is extracted in candidate region using CNN, connects the grader of stand-alone training afterwards to predict that the region includes interesting target Confidence level, this is transformed into the method for an image classification.It this method solve CNN orientation problem, but since it is desired that Each region candidate frame is handled, so there is largely compute repeatedly.R-CNN frame diagram is as shown in Figure 5.

SPPNet core concept is utilization space pyramid pond (spatial pyramid pooling, abbreviation SPP) Layer removes the limitation of network fixed size, spatial pyramid pond Hua Ceng ponds feature and produces the output of fixed size.Cause For in CNN full articulamentum be the restriction that has fixed input picture size, and the image that conventional part need not then be fixed is big It is small, so spatial pyramid pond layer is placed on behind last convolutional layer, before first full articulamentum.This method Strategy is improved close to 100 times than the speed of R-CNN target detection.

Fast R-CNN use for reference SPPNet thought, it is proposed that layer region interested (Region of Interest, Abbreviation RoI), the SPPNet of individual layer Internet is can be regarded as, different size of input can be mapped to fixation by RoI The characteristic vector of yardstick, softmax features trainings grader progress type identification and window regression algorithm is recycled to be positioned. The advantages of Fast R-CNN combination SPPNet and R-CNN, greatly reduces the time of target detection process needs.

By carrying out certain kinds target detection training to FasterR-CNN neutral nets, by the certain kinds target in image Detect, recycle its characteristic will likely Objective extraction come out.And obtained target is extracted like physical property fraction.So Afterwards, generation is started like thing figure.Faster R-CNN verification and measurement ratio is very high, but can can't detect specific classification in extreme circumstances Mark, when Faster R-CNN can't detect target if handled entire image as target.

Possibility that target has told us this window contains target like physical property fraction, using may target obtain picture Plain level like physical property fraction, the fraction is used for teaching that the possibility that this pixel is a target part, Pixel-level like thing The definition such as formula (1) of property fraction, wherein s_iFor either with or without the possible fraction of the target comprising pixel p, G_iFor Gaussian window, x, y are pictures Plain p coordinate.

Objectness (R)=∑_i∈RPixObj(p_i) (2)

Step 5：Multilayer cellular automata merges.It is linear among currently used notable figure or Feature fusion to melt The methods of conjunction, the dot product fusion based on pixel, condition random field are merged, merged based on cellular automata.What Qin et al. was proposed Multilayer cellular automata (Multi-layer Cellular Automata, abbreviation MCA) fusion method is relatively preferable.This method By the notable figure that different algorithms obtains as one layer of cellular automata, and the advantage of algorithms of different is combined, updated by one kind Mechanism constantly updates saliency value, finally gives the notable figure of fusion.

In multilayer cellular automata, each pixel of notable figure is exactly a cellular, in N layer automatic machines, arbitrarily Cellular in one notable figure has N-1 neighbours, the same position being respectively in other notable figures.With pixel i saliency value Indicate it as prospect F probability P (i ∈ F)=S_i, then probability P (i ∈ the B)=1-S as background B_iRepresent.By carrying Take the adaptive threshold of each width figure, the binary-state threshold γ of m width notable figures_mRepresent, and binary conversion treatment is carried out to it. If pixel i saliency value S_i≥γ_m, then prospect is marked as, and use η_i=+1 represents, on the contrary, η_i=-1, represent pixel i For background.

If pixel i is marked as prospect, then its neighbours j of same position in other notable figures is marked as prospect Probability be that the probability of prospect is λ=P (η_j=+1 | i ∈ F).So, with μ=P (η_j=-1 | i ∈ B) represent that pixel i is labeled For background when, its neighbours j turn into background probability.It is assumed that λ and μ are a pair of equal constants, then and posterior probability P (i ∈ F | η_j =+1) formula (5) can be used to represent：

P(i∈F|η_j=+1) ∝ P (i ∈ F) P (η_j=+1 | i ∈ F)=S_i·λ (5)

Prior probability ratio is defined as Λ (i ∈ F), calculation formula is then (6)：

Then posterior probability ratio Λ (i ∈ F | η_j=+1) as formula (7) represents：

In order to facilitate calculating, make l=ln (Λ) conversion to above formula, as shown in formula (8)：

For the ease of representing the update mechanism of cellular automata, by prior probability ratio and posterior probability such as formula (9) institute Show：

WhereinRepresent pixel i in the saliency value of t, synchronized update mechanism f：S^M-1→ S is defined such as formula (3.10) institute Show：

Wherein,T is represented, the saliency value of all cellular automatas in m width notable figures. Matrix 1 is the matrix [1 ..., 1] for having N number of element^T.If pixel its neighbour are judged as prospect, accordingly increase The saliency value of itself, i.e., should haveThen there are λ ＞ 0.5. rule of thumb to setThen N²Secondary renewal Afterwards, final notable figure can be obtained by formula (11)

Due to having carried out multi-scale division to image at the beginning of conspicuousness target detection, splitting the selection of yardstick is Selected on the premise of performance and effect is ensured, finally have selected five yardsticks, 100 pictures are included from each super-pixel unit Element, between 500 pixels, each yardstick is incremented by by 100 pixels.Finally optimized to obtain five yardsticks according to conspicuousness Notable figure, carried out consensus using multilayer cellular automata.In the multiple dimensioned notable figure obtained by above step, because To divide the difference of yardstick, to obtain notable figure also not quite identical.Simultaneously because conspicuousness calculating is in units of super-pixel, The saliency value of well-marked target is also piecemeal, discontinuous, so multiple dimensioned notable figure is merged afterwards still using in general method So exist inconsistent, in order that final notable figure is consistent as far as possible, carries out uniformity using multilayer cellular automata and melt Close optimization.

Claims

1. the conspicuousness algorithm of target detection according to claim 1 based on Faster R-CNN, it is characterised in that include Following steps：I. Faster R-CNN target detections, including the pixel of the gray value of each pixel, square are carried out to input picture The grayscale digital image of battle array, obtains candidate target region frame；II. use super-pixel segmentation algorithm by input picture in three yardsticks On split, super-pixel segmentation algorithm is according to low-level image features such as color, texture and brightness, and adjacent similar pixel is gathered Into image-region of different sizes, the complexity that conspicuousness calculates can be effectively reduced；III. by Faster R-CNN god Certain kinds target detection training is carried out through network, the certain kinds target detection in image is come out, recycles its characteristic will likely Objective extraction is come out, and obtained target is extracted like physical property fraction, is generated like thing figure.IV. with point of different scale Figure is cut as one layer of cellular, the super-pixel segmentation figure of each different scale is merged using multilayer cellular automata, merged Notable figure afterwards is final notable figure.

2. the conspicuousness algorithm of target detection of the certain kinds according to claim 1 based on Faster R-CNN, its feature Conspicuousness target in the image after by super-pixel segmentation substantially chooses the image using Faster R-CNN, reduces it The complexity of conspicuousness detection afterwards.The uniform segmentation result of shape size for obtain using multiple dimensioned super-pixel segmentation, Ensure the uniformity of saliency detection.

3. the conspicuousness algorithm of target detection of the certain kinds according to claim 1 based on Faster R-CNN, its feature It is to carry out Faster R-CNN target detections to input picture, includes the ash of the pixel of the gray value of each pixel, matrix Rank digital picture, start to calculate candidate region window first with Fast R-CNN, simultaneously, RPN is also defeated using convolutional layer The characteristic pattern gone out calculates candidate region, and after both calculating terminate, RPN threshold values according to expected from the user of setting pass to Fast The candidate region window that R-CNN should be selected, obtain the substantially target in saliency region.

4. the conspicuousness algorithm of target detection of the certain kinds based on Faster R-CNN, it is characterised in that Faster R-CNN are applied Optimize the result of notable figure, including following characteristics for certain kinds, while the methods of application is multiple dimensioned, cellular automata：

1) this algorithm is the algorithm that Faster R-CNN first Applications detect for the conspicuousness of certain kinds (aircraft, ship, cat etc.)；

2) this algorithm is in Faster R-CNN by multi-scale division first Application；

3) this algorithm is first Application Faster R-CNN acquisitions like thing figure；

4) multilayer cellular automata is applied to Faster R-CNN by this algorithm first.

5. the conspicuousness algorithm of target detection of the certain kinds according to claim 4 based on Faster R-CNN, its feature It is the algorithm for first detecting Faster R-CNN first Applications for the conspicuousness of certain kinds (aircraft, ship, cat etc.).In recent years Coming, conspicuousness algorithm has a lot, including based on the preferential detection model of background, based on the preferential detection model of contrast, but base It is limited in the feature that the conspicuousness detection model of low-level feature extracts, it is impossible to make full use of the feature of image.This algorithm passes through Faster R-CNN extract the further feature in image.

6. the conspicuousness algorithm of target detection of the certain kinds according to claim 4 based on Faster R-CNN, its feature It is to apply multi-scale segmentation method and Faster R-CNN.Conventional super-pixel segmentation algorithm has two kinds of SLIC and watershed Partitioning algorithm.Using multiple multi-scale segmentation images, candidate region, feature extraction, target are divided using Faster R-CNN afterwards Class and position refine merge, and region suggests that network shares convolution feature calculation with Fast R-CNN, and input picture is shown Show and be input in sorter network common on ImageNet, then Fast R-CNN start to calculate candidate region window, same in this When, the characteristic pattern that RPN is also exported using convolutional layer calculates candidate region, both calculate terminate after, RPN is according to the user of setting Expected threshold value tells which candidate region window Fast R-CNN should select.Using the Faster R-CNN trained come to defeated The image entered carries out target detection, then obtains candidate target region frame, extracts this regional frame and store, and is next step Conspicuousness detection provides preparation.

7. the conspicuousness algorithm of target detection of the certain kinds according to claim 4 based on Faster R-CNN, its feature It is to utilize Faster R-CNN, by carrying out it certain kinds target detection training, by the certain kinds target detection in image Out, recycle its characteristic will likely Objective extraction come out, finally provide the target window of optimization, it would be possible to the position of target Extract, and obtained target is extracted like physical property fraction.Then generation is started like thing figure.