CN109583517A

CN109583517A - A kind of full convolution example semantic partitioning algorithm of the enhancing suitable for small target deteection

Info

Publication number: CN109583517A
Application number: CN201811601302.9A
Authority: CN
Inventors: 胡辉; 司凤洋
Original assignee: East China Jiaotong University
Current assignee: East China Jiaotong University
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2019-04-05

Abstract

The invention belongs to technical field of image processing, divide (FCIS) algorithm based on full convolution example semantic, disclose a kind of full convolution example semantic partitioning algorithm of enhancing suitable for small target deteection, it includes that sharing feature figure extracts, frame is preselected to extract, generate position sensing score map, classification and recurrence；In sharing feature figure extraction process, fusion conv1, conv3 and conv5 characteristic pattern, so that sharing feature figure remains high semantic information and high detailed information are proposed；It is bad for pre-selection frame extraction network effect in pre-selection frame extraction process, propose that dual RPN algorithm, its average recall rate ratio RPN improve 7%.The mAP ratio FCIS of EFCIS algorithm of the present invention improves 3.5%, and for small size target, the mAP ratio FCIS of EFCIS algorithm improves 2.9%.It is shown experimentally that the present invention is very beneficial for promoting the ability of crawl Small object.

Description

A kind of full convolution example semantic partitioning algorithm of the enhancing suitable for small target deteection

Technical field

The invention belongs to technical field of image processing more particularly to a kind of full convolution of the enhancing suitable for small target deteection Example semantic partitioning algorithm.

Background technique

Currently, the prior art commonly used in the trade is such that

Scene understanding is a core difficult point of computer vision field, and example semantic segmentation is to realize scene understanding One necessary process, in image domains, example semantic segmentation is integrated images classification, target detection, image segmentation it is comprehensive Task, it is widely used in GIS-Geographic Information System, unmanned, medical image analysis, robot and other field.

As the deep learning based on convolutional neural networks rapidly develops, more and more example semantic segmentations subtask can It is completed with using convolutional neural networks, and in recent years, full convolutional neural networks (FCN) related algorithm has monopolized image Semantic segmentation field, wherein divided in related algorithm using full convolutional neural networks processing image, semantic, full convolution example semantic Segmentation (FCIS) achieves extremely excellent effect, and full convolution example semantic segmentation (FCIS) has been gathered numerous at this stage outstanding Achievement is first algorithm for realizing convolution example semantic segmentation end to end, and divides the achievement in field in example semantic Other algorithms are held a safe lead, it obtains 2016 MS COCO segmentation challenge first places, and much leads First second place.

In conclusion problem of the existing technology is:

(1) since defect-convolutional neural networks of convolutional neural networks itself are down-sampled, cause lesser in picture Target may just disappear during sharing feature is extracted, and can not participate in the later period task of algorithm.

(2) full convolution example semantic segmentation (FCIS) is typical " two-stage " algorithm, in the first stage of the algorithm Need to extract a large amount of pre-selection frames, and the recall rate that existing RPN network extracts pre-selection frame still has several drawbacks, causes at this The reduced performance of the later period task of algorithm.

Solve the difficulty and meaning of above-mentioned technical problem:

Difficulty:

(1), Small object characteristic pattern is caused to be lost for convolutional neural networks are down-sampled, industry and academia follow at present Thought be fusion shallow-layer, middle layer, the characteristic pattern of deep layer network, but will increase dramatically entire model parameter quantity in this way, this So that whole network training is very difficult, it is extremely easy training over-fitting, and need to greatly improve computing resource, entire algorithm Execute the time be significantly increased, other than these difficulties, during this algorithm improvement, we further encounter using up-sampling or We lose bulk information during person is down-sampled, and these processes be not can learning process, this is for entire algorithm essence Exactness is mortality strike.

(2), current whether industry or academia, for " two-stage " algorithm, pre-selection frame, which extracts, all to be used RPN network, but RPN network will obtain the quantity that higher recall rate needs to increase substantially pre-selection frame, this calculates the later period Method speed-raising generates great burden, while even if extracting a large amount of pre-selection frame, still resulting in target loss, proposing thus The schemes such as Selective Search, Edge boxes, but still be difficult to solve the above problems.

Meaning:

(1), for problem 1, the invention proposes a kind of Fusion Features schemes, merge conv1, conv3, conv5, take into account The high semantic information that the high detailed information and further feature that shallow-layer feature has have, and will be fused by 1 × 1 convolution Feature Compression is mutually coordinated between the sharing feature map of guarantee participation later period task in uniform spaces, in fusion process, It is down-sampled for 2 expansion convolution progress that we use stride to conv1, and is up-sampled to conv5 using transposition convolution, these Sampling operation be all can learning process and the sampling of traditional bilinearity and interpolation method it is completely different, to preventing characteristic pattern to be distorted There is very big help.So we, which are characterized characteristic pattern sampling in fusion process, proposes a new angle.

(2), it is directed to problem 2, the present invention proposes to be based on one RPN network of each training of conv3 and conv4, passes through NMS algorithm The pre-selection frame generated to the two merges, and for this scheme, we increase a semantically enhancement network after conv3, this Sample solves the problems, such as that shallow-layer network is semantic lower, and the pre-selection frame for just enabling conv3 and conv4 to extract in this way merges, Dual RPN be for the first time by it is proposed that, and solve the problems, such as that shallow-layer Feature Semantics are low during realization.And For dual RPN than currently a popular RPN, Selective Search, Edge boxes performance is more excellent.

Summary of the invention

In view of the problems of the existing technology, the present invention provides a kind of full convolution of enhancing suitable for small target deteection Example semantic partitioning algorithm.So proposing that enhanced full convolution example semantic segmentation (EFCIS) is calculated for these problems present invention Method, for solving the problems, such as that full convolution example semantic segmentation (FCIS) algorithm is bad for Small object segmentation performance.

It is calculated the invention is realized in this way a kind of full convolution example semantic of enhancing suitable for small target deteection is divided Method, comprising:

Step 1: sharing feature figure extracts, and carries out up-sampling to the resolution ratio of the characteristic pattern of different layers or down-sampling is gone forward side by side Row Fusion Features；

Step 2: pre-selection frame extracts, and two RPN networks are respectively trained in conv3 and conv4, generates to two RPN networks Pre-selection frame carries out comprehensive extraction using non-maxima suppression NMS；

Step 3: generating the score map based on position sensing, generates sharing feature figure based on full convolutional network FCN, then Use 2k²× (C+1)-d ties up the score map that 1 × 1 convolution generates position sensing, and wherein C+1 indicates C target category and 1 Background classification, the position sensing characteristic spectrum of each ROI is by corresponding k²A position sensing map is spliced；

Step 4: classification and recurrence return the ROI position sensing map that step 3 generates, corresponding score Map is classified, and in classification, obtains the score map of Pixel-level by the ROI position sensing map of splicing correspondence, will be in ROI Each pixel, by two 1 × 1 convolution judge respectively the pixel whether in this ROI and the pixel whether in object In the bounds of body.

Further, step 1 specifically includes:

Sharing feature figure extracts, and when input picture the ratio of width to height is constant, short side is adjusted to 600；Using ResNet-101 The characteristic pattern that conv1, conv3 and conv5 layer of basic network Model Fusion extracts sharing feature figure；To the characteristic pattern of different layers Make the resolution ratio of characteristic pattern consistent using different up-sampling or down-sampling；

Use dilation for 2, stride 2 conv1, the expansion convolution that kernel 3, padding are 0 makes The resolution ratio of conv1 layers of characteristic pattern declines 2 times；Keep conv3 resolution ratio constant；

Merge conv3 and it is down-sampled after conv1 characteristic pattern, reuse 1 × 1 × 512 convolution will from conv1 and The characteristic pattern of conv3 is compressed in uniform spaces；

Conv5 layers use dilation for 2, stride 2, the deconvolution that kernel 3, padding are 0 (Deconv), so that conv5 layers of characteristic pattern resolution ratio up-samples 2 times；

Conv1 is come from using the fusion of 1 × 1 × 512 convolution, conv3, conv5 layers of characteristic pattern is final to realize that feature is melted It closes.

Further, step 2 specifically includes:

1) pass through conv4 layers of feature training deep layer RPN network；

2) the semantically enhancement network being made of 33 × 3 × 512 convolutional layers is added on the basis of conv3；

3) on the basis of the characteristic pattern of step 2) output, training shallow-layer RPN network；

4) the RPN network of step 1) and step 3) respectively generates 300 pre-selection frames, using soft-NMS algorithm fusion 2 The pre-selection frame of RPN network, and the parameter IoU that soft-NMS algorithm is arranged is 0.7, will predict highest 300 pre-selections of score Frame retains.

Further, step 3 specifically includes

Use 2k²1 × 1 convolution of × (C+1)-d generate position sensing score map, wherein C+1 indicate target category and One background classification, setting (K, C) are (7,80)；And each characteristic spectrum compares down-sampled 16 times of resolution ratio of input picture, Each ROI is by corresponding k²A position sensing map combines to be formed.

Further, step 4 specifically includes:

The ROI of generation return and its score map is classified, if ROI callout box corresponding with its IoU is more than 0.5, and it is positive sample that the ROI, which is arranged, and otherwise, which is marked as negative sample, and is carrying out classification regression process It is middle to use a multitask loss function,

L(k,k^*,m,m^*,t,t^*)=L_cls(k,k^*)+L_mask(m,m^*)+L_reg(t,t^*)；

Wherein L_clsIt is the softmax loss function for positive and negative sample pane, L_regFor the recurrence of positive sample frame, L_maskWith In the softmax function of the mask of positive sample, k and k^*Indicate ROI prediction result and true tag, m and m^*Indicate each pixel Prediction result and mark；

In recurrence, a positive sample frame vector t=(t_x,t_y,t_h,t_w) and predicted vector t^*=(t_x ^*,t_y ^*,t_h ^*,t_w ^*) simultaneously And calculating t is to be calculated by the following formula, t_x=(G_x-P_x)/P_w, t_y=(G_y-P_y)/P_h, t_w=log (G_w/P_w), t_h=log (G_h/P_h)；

Wherein P_i=(P_x,P_y,P_h,P_w) it is prediction block, P_h,P_wIndicate that the width of prediction block is high, P_x,P_y, indicate in prediction block Heart point coordinate, same method calculate callout box G；

In classification, ROI is given, the score map of Pixel-level is obtained by the position sensing map of the ROI of splicing correspondence；It is logical 1 × 1 × 512 convolution operations are crossed to classify.

Another object of the present invention is to provide a kind of full convolution for the enhancing for being suitable for small target deteection described in realize is real The computer program of illustrative phrase justice partitioning algorithm.

Another object of the present invention is to provide a kind of terminal, it is described suitable for Small object that the terminal at least carries realization The controller of the full convolution example semantic partitioning algorithm of the enhancing of detection.

Another object of the present invention is to provide a kind of computer readable storage mediums, including instruction, when it is in computer When upper operation, so that computer executes the full convolution example semantic partitioning algorithm of the enhancing suitable for small target deteection.

Another object of the present invention is to provide a kind of full volumes for implementing to realize the enhancing suitable for small target deteection The automatic driving vehicle of product example semantic partitioning algorithm.

Another object of the present invention is to provide a kind of full volumes for implementing to realize the enhancing suitable for small target deteection The medical image analytical equipment of product example semantic partitioning algorithm.

Another object of the present invention is to provide a kind of full volumes for implementing to realize the enhancing suitable for small target deteection The intelligent robot of product example semantic partitioning algorithm.

In conclusion advantages of the present invention and good effect are as follows:

(1), sharing feature figure is made to be provided simultaneously with high semantic information (for classifying) and high details letter for Fusion Features The performance for ceasing (for positioning), and extracting pre-selection frame for network just tests the classification capacity and stationkeeping ability of network simultaneously, This present invention uses resnet-50, based on the training of 2007 data set of pascal voc and verifying, IoU=0.5 is arranged, by Fig. 5 Know conv1, the effect of conv3, conv5 fusion is best, and multitiered network is better than single layer network effect, and deep layer network is than shallow Layer network effect is good.

(2), it is based on 2007 data set of pascal voc, dual RPN algorithm and pre-selection frame all the fashion at present extract Algorithm Edge Boxes, Selective Search and RPN is compared, as a result as shown in Figure 3.

Under conditions of prediction score highest pre-selection frame takes 50,100 and 200 respectively, dual RPN proposed by the present invention Other three kinds pre-selection frames can far be won and extract network.Particularly with the prediction highest preceding 50 pre-selections frame of score, when IoU is set as When 0.5, dual RPN algorithm realizes 94.4% recall rate, and than RPN, Selective Search and Edge boxes's is called together The rate of returning is higher by 10.4%, 41.4%, 38.4% respectively.

When IoU is respectively set to 0.5,0.6,0.7, dual RPN performance proposed by the present invention far wins other three kinds pre-selections Frame extracts network.Particularly with the highest preceding 1000 pre-selection frame of prediction score, dual RPN algorithm recall rate ratio RPN algorithm difference It is higher by 3.2%, 8%, 10.5%；It is higher by 9.3%, 18%, 21.5% respectively than Selective Search algorithm；Compare Edge Boxes algorithm will be higher by 7.2%, 10.5%, 12%, show that algorithm positioning performance proposed by the present invention is prominent, be very beneficial for Promote Small object Grabbing properties.As a result as shown in Figure 4.

Generally, pre-selection frame, the IoU 0.5 of prediction score highest preceding 300 are taken, dual RPN proposed by the present invention is calculated Method realizes 99.7% recall rate, is higher by 7% than RPN algorithm.

(3), it is based on MS COCO data set, as shown in Table 1, very popular at present flip horizontal and multiple dimensioned training two Kind scheme makes the mAP of EFCIS algorithm improve 0.6% and 0.8% respectively, and data cutting scheme proposed by the present invention makes The mAP of EFCIS algorithm improves 2.5%.In particular for Small object, flip horizontal and multiple dimensioned training improve 0.4% respectively With 0.5%, and data cutting scheme of the invention makes the mAP of EFCIS algorithm improve 2%.

(4), it is based on MS COCO data set, as shown in Table 2, the mAP ratio FCIS algorithm of EFCIS algorithm improves 3.5%； In particular for Small object, the mAP ratio FCIS of EFCIS improves 2.9%, it was demonstrated that EFCIS algorithm of the invention has stronger Small object Grasping skill.In addition, for the target of medium size and large scale target, the mAP ratio FCIS of EFCIS algorithm 3.6% and 4.1% has been respectively increased.Such as Fig. 7 is divided in comparison visualization, (a): original graph, (b): FCIS segmentation result, (c): EFCIS segmentation result.

Detailed description of the invention

Fig. 1 is the full convolution example semantic partitioning algorithm of the enhancing provided in an embodiment of the present invention suitable for small target deteection Flow chart.

Fig. 2 is Fusion Features visualization provided in an embodiment of the present invention and feature sampling process figure.

Fig. 3 is provided in an embodiment of the present invention based on 2007 data set of PASCAl, recall and IoU comparison line chart.

In figure: (a), prediction score Top-50 pre-selection frame；(b), the pre-selection frame of score Top-100 is predicted；(c), it predicts The pre-selection frame of score Top-200.

Fig. 4 is provided in an embodiment of the present invention based on 2007 data set of PASCAl, recall and proposal number Compare line chart.

In figure: (a), IoU=0.5；(b), IoU=0.6；(c), IoU=0.7.

Fig. 5 be the embodiment of the present invention propose Fusion Features, be based on 2007 data set of PASCAl VOC, recall with Proposal number compares line chart.

Fig. 6 is image cutting scheme figure provided in an embodiment of the present invention.

Fig. 7 is the EFCIS algorithm and FCIS Algorithm Demo comparison diagram that embodiment provides.In figure: (a), inputting picture； (b), FCIS algorithm segmentation result；(c), EFCIS algorithm segmentation result.

Fig. 8 is dual RPN algorithm schematic diagram provided in an embodiment of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

In the present invention, it is that Fast RCNN algorithm mentions for the first time that RPN, which is that full name is region of proposal network, Out for preselecting the network of frame extraction.

Recall rate (recall): also known as recall ratio refers to the ratio of the relevant documentation number and total number of files that retrieve.

ResNet-101: the network be had in 2015 by it is triumphant it is bright et al. be suggested, and it is big in ImageNet image classification Match obtains the first place and gains considerable fame.

ROI: full name region of interest, it is referred to as area-of-interest, that is, mesh in algorithm of target detection Indicate region that may be present.

FCN: full name is Fully convolutional neural network, by the Jonathan of UC Berkeley Long et al. was proposed in 2015.

NMS: full name is Non-maximum Suppression, by paper EFFicient Non-maximum Suppression is put forward for the first time, for extracting the highest window of score in target detection.

Application principle of the invention is described in detail below with reference to concrete scheme.

As shown in Figure 1, the full convolution example semantic of the enhancing provided in an embodiment of the present invention suitable for small target deteection point Cutting algorithm is on the basis of full convolution example semantic divides (FCIS) algorithm, in order to improve full convolution example semantic segmentation (FCIS) the enhanced full convolution example semantic that algorithm is divided Small object performance and proposed divides (EFCIS), mainly passes through spy Sign fusion, dual RPN, image cuts three technical solutions, and to solve, convolutional neural networks are down-sampled to cause Small object feature to be lost It loses, RPN network preselects frame and extracts the bad problem of recall rate performance, to promote full convolution example semantic segmentation (FCIS) algorithm Divide Small object performance.

Specific step is as follows:

Step 1: sharing feature figure extracts, and keeps the ratio of width to height of input picture constant, and short side is adjusted to 600, this Invention extracts sharing feature figure using ResNet-101, and on the basis of ResNet-101, present invention fusion comes from conv1, Conv3 and conv5 layers of characteristic pattern, the present invention are protected for the characteristic pattern of different layers using different up-sampling or down-sampling The resolution ratio of characteristics of syndrome figure is consistent, for conv1, uses dilation for 2, stride 2, kernel 3, padding 0 Expansion convolution so that conv1 layers of characteristic pattern resolution ratio decline 2 times, merge it is down-sampled after conv1 and conv3 spy Sign figure, unifies this feature figure in a space using 1 × 1 × 512 convolution, for conv5 layers, uses dilation for 2, Stride is 2, kernel 3, and the deconvolution that padding is 0 equally samples so that conv5 layers of characteristic pattern up-samples 2 times Above-mentioned integration program, final fusion come from conv1, conv3, conv5 layers of characteristic pattern, to realize the following Fig. 2 of Fusion Features.

Step 2: pre-selection frame extracts, and in order to improve the recall rate that traditional RPN network pre-selection frame extracts, the present invention is led to respectively Conv3 and conv4 two RPN networks of training are crossed, the pre-selection frame generated respectively to it is melted using non-maxima suppression (NMS) It closes, details of operation is as follows:

(1) by conv4 layers of feature one RPN network of training, the present invention is referred to as deep layer RPN network.

(2) present invention adds one by 33 × 3 × 512 convolutional layers (activation after each convolution on the basis of conv3 Function is ReLU) enhancing semantic network, its parameter (kernel, padding, stride, filter is arranged in the present invention Number) it is (3,1,1,512).

(3) on the basis of the characteristic pattern of (2) output, the present invention has one RPN network of training, and the present invention is referred to as shallow-layer RPN network.

(4) 300 pre-selection frames can produce by the RPN network of (1) and (3) respectively, the present invention uses soft-NMS algorithm The parameter IoU for merging the pre-selection frame from two RPN networks, and soft-NMS algorithm being arranged is 0.7.Make prediction point in this way The highest 300 pre-selections frame of number will all remain.

Step 3: generating the score map based on position sensing, the sharing feature figure generated for settlement steps to deal one it is flat Motion immovability, the present invention using full convolutional network (FCN) scheme go generate position sensing score map, one 1 × 1 × 1024 Convolution operation then use 2k for sharing feature figure²The score chart of 1 × 1 convolution of × (C+1)-d generation position sensing Spectrum, wherein C+1 indicates target category and a background classification, based on the MS COCO dataset present invention be arranged (K, C) be (7, 80), here, each characteristic spectrum compares down-sampled 16 times of resolution ratio of input picture, each ROI is by corresponding k²Position is quick Sense map is composed.

Step 4: classification and recurrence, carry out recurrence score map corresponding with its for the ROI that step 3 generates and divide Class, if ROI and its IoU for corresponding to callout box are more than 0.5 present invention setting, it is positive sample, and otherwise, this ROI is marked as Negative sample, the present invention uses a multitask loss function, L (k, k during training^*,m,m^*,t,t^*)=L_cls(k,k^*) +L_mask(m,m^*)+L_reg(t,t^*), wherein L_clsIt is the softmax loss function for positive and negative sample pane, L_regFor positive sample frame Recurrence, L_maskIndicate the softmax function for being used for positive sample mask, k and k* indicate ROI prediction result and true tag, m and M* indicates the prediction result and mark of each pixel, for regression problem, the label vector t=(t of a positive sample frame_x,t_y, t_h,t_w) and predicted vector t^*=(t_x ^*,t_y ^*,t_h ^*,t_w ^*) and the present invention calculate t be to be calculated by the following formula, t_x=(G_x- P_x)/P_w, t_y=(G_y-P_y)/P_h, t_w=log (G_w/P_w), t_h=log (G_h/P_h), wherein P_i=(P_x,P_y,P_h,P_w) it is prediction block, P_h,P_wIndicate that the width of prediction block is high, P_x,P_y, indicate the center point coordinate of prediction block, similarly methods calculate callout box G.It is right In classification problem, a ROI is given, the score map of Pixel-level is obtained by the position sensing map of the ROI of splicing correspondence, this When for each pixel in this ROI, there are two tasks, first, whether the pixel in this ROI, second, the pixel is It is no in the bounds of target object.The present invention realizes the two tasks by two 1 × 1 convolution operations.

Application of the invention is further described with reference to the accompanying drawing.

By double RPN algorithms of the invention in pre-selection frame extraction algorithm Edge Boxes, Selective all the fashion at present Search and RPN are compared, for predicting that the highest preceding 50 pre-selections frame of score, setting IoU are 0.5, dual of the invention RPN algorithm realizes 94.4% recall rate, and than RPN, Selective Search and Edge boxes will be higher by 10.4%, 41.4%, 38.4%, it is 0.5,0.6,0.7 when IoU is respectively set, predicts highest preceding 1000 prediction blocks of score, the present invention The recall rate ratio RPN algorithm of dual RPN algorithm to be higher by 3.2%, 8%, 10.5%, than Selective Search algorithm It it is higher by 9.3%, 18%, 21.5%, is higher by 7.2%, 10.5%, 12% than Edge boxes algorithm, it was demonstrated that the present invention Dual RPN algorithm brilliance positioning performance, this be very beneficial for promoted small target auto-orientation performance.Particularly, for prediction point The pre-selection frame of number highest preceding 300, IoU 0.5, dual RPN algorithm of the invention realize 99.7% recall rate, compare RPN Algorithm is higher by 7%, as a result such as Fig. 3, shown in Fig. 4.

Under the standard of MS COCO data set, image cutting scheme of the invention improves 2.5% to the mAP of algorithm, and Current very popular flip horizontal and multiple dimensioned trained two schemes make the mAP of algorithm improve 0.6% and 0.8% respectively. Particularly, for Small object, the mAP of data cutting scheme boosting algorithm of the invention is 2%, and flip horizontal and multiple dimensioned instruction It is experienced then improve 0.4% and 0.5% respectively, comparing result such as the following table 1.

By merging conv1, conv3 and conv5, so that the sharing feature figure that the present invention extracts is provided simultaneously with high semanteme point The location information of category information and high details, while one times of the increase resolution of sharing feature figure have stronger crawl Small object Ability.

By the experiment based on the verifying collection of MS COCO 2014, it is a discovery of the invention that the mAP ratio of EFCIS algorithm of the present invention FCIS algorithm improves 3.5%, and particularly, for Small object, the mAP ratio FCIS of EFCIS of the invention improves 2.9%, card The ability that algorithm of the invention has stronger crawl Small object is illustrated, comparing result is as shown in table 2 below.

Table 1 uses different data enhanced schemes based on MS COCO data set, and wherein EFCIS baseline contains spy It levies fusion and dual RPN but eliminates data cutting of the invention.

Table 2 is based on MS COCO data set, EFCIS and FCIS Comparative result

Application principle of the invention is further described combined with specific embodiments below.

Embodiment

Such as Fig. 1, the embodiment of the present invention proposes a kind of full convolution example semantic segmentation of enhancing suitable for small target deteection Algorithm (EFCIS).First, proposing a kind of Fusion Features scheme and image cutting scheme, solve since convolutional neural networks drop The problem of sampling causes Small object characteristic pattern to be lost；Second, being based on RPN network, propose that dual RPN greatly improves extraction pre-selection Frame recall rate.Based on MS COCO data set, the mAP of final EFCIS algorithm proposed by the present invention compares FCIS and improves 3.5%, Particularly with small size target, the mAP of EFCIS algorithm compares FCIS and improves 2.9%.Specific steps are as follows:

Step 1: building hardware environment, and core of the invention hsrdware requirements are that RAM:256G SSD solid state hard disk adds 2T machine Tool hard disk；ROM:32G DDR4 memory bar；CUP:Intel CORE I7-7700K；GPU:NVIDIA GTX 1070Ti (8G).

Step 2: software environment is built, first installation 16.04 operating system of Ubuntu, is based on Ubuntu operating system, Linux x64 Display Driver (version:390.87) video driver is installed, CUDA9.0+CUDNN5.1 video card of arranging in pairs or groups Accelerating driving, installation compiling Mxnet (commit 998378a) deep learning frame installs Cython, opencv3.2, The scientific algorithm library of easydict 1.6, hickle based on python interface.

Step 3: as shown in Fig. 2, sharing feature figure extracts, keep the ratio of width to height of input picture constant, and short side tune Whole is 600；On the basis of ResNet-101, fusion comes from conv1, conv3 and conv5 layers of characteristic pattern；For conv1, Use dilation for 2, stride 2, the expansion convolution that kernel 3, padding are 0, the conv1 after fusion is down-sampled With the characteristic pattern of conv3, unify this feature figure in a space using 1 × 1 × 512 convolution；For conv5 layers, use Dilation is 2, stride 2, and the deconvolution that kernel 3, padding are 0 equally samples above-mentioned integration program, finally Fusion comes from conv1, and conv3, conv5 layers of characteristic pattern realizes Fusion Features.

Step 4: the algorithm of dual RPN as shown in figure 8,

1) pass through conv4 layers of feature training deep layer RPN network；

2) the enhancing semantic network by 33 × 3 × 512 convolutional layers is added on the basis of conv3；

4) 300 pre-selection frames are generated by the RPN network of step 1) and step 3) respectively, using soft-NMS algorithm fusion Pre-selection frame from 2 RPN networks, and the parameter IoU that soft-NMS algorithm is arranged is 0.7, finally to prediction score highest Pre-selection frame retain.

Step 5: the sharing feature figure generated based on step 3 uses 2k²It is quick that 1 × 1 convolution of × (C+1)-d generates position The score map of sense, wherein C+1 indicates target category and a background classification, and (K, C) is (7,80)；Each characteristic spectrum is compared Down-sampled 16 times of resolution ratio of picture are inputted, the characteristic pattern of each ROI based on step 4 is by corresponding k²Position sensing map Combination is formed.

Step 6: carrying out returning corresponding with its score map and classify for the ROI of generation, if a ROI with The IoU that ROI corresponds to callout box is more than 0.5, and setting ROI is positive sample, and otherwise, this ROI is marked as negative sample, trained A multitask loss function, L (k, k are used in the process^*,m,m^*,t,t^*)=L_cls(k,k^*)+L_mask(m,m^*)+L_reg(t,t^*)；

Wherein L_clsIt is the softmax loss function for positive and negative sample pane, L_regFor the recurrence of positive sample frame, L_maskTable Show that the softmax function for positive sample mask, k and k* indicate that ROI prediction result and true tag, m and m* indicate each picture The prediction result and mark of element；

For recurrence, the label vector t=(t of a positive sample frame_x,t_y,t_h,t_w) and predicted vector t^*=(t_x ^*,t_y ^*, t_h ^*,t_w ^*) and to calculate t be to be calculated by the following formula, t_x=(G_x-P_x)/P_w, t_y=(G_y-P_y)/P_h, t_w=log (G_w/P_w), t_h=log (G_h/P_h)；

Wherein P_i=(P_x,P_y,P_h,P_w) it is prediction block, P_h,P_wIndicate that the width of prediction block is high, P_x,P_y, indicate in prediction block Heart point coordinate similarly calculates callout box G, finally obtains the example semantic segmentation result of a picture is complete, as shown in Figure 7. Fig. 6 is image cutting scheme figure provided in an embodiment of the present invention.

The comparing result of Fusion Features scheme proposed by the present invention is as shown in figure 5, merge conv1, conv3, conv5 feature It is best to scheme the effect obtained, and therefrom it has also been found that multilayer feature figure fusion ratio lacks layer characteristic pattern syncretizing effect more preferably, deep layer is special Figure is levied than shallow-layer characteristic pattern better effect.

As shown in figure 3, when take the prediction highest preceding 50 pre-selections frame of score, when IoU=0.5, the realization of dual RPN algorithm 94.4% recall rate, than RPN, the recall rate of Selective Search and Edge boxes are higher by 10.4% respectively, 41.4%, 38.4%.

As shown in figure 4, when taking the prediction highest preceding 1000 pre-selections frame of score, IoU is respectively set to 0.5,0.6,0.7 When,.Dual RPN algorithm recall rate ratio RPN algorithm is higher by 3.2%, 8%, 10.5% respectively；It is calculated than Selective Search Method is higher by 9.3%, 18%, 21.5% respectively；7.2%, 10.5%, 12% is higher by than Edge boxes algorithm.

As shown in fig. 7, the picture in visualization portion MS COCO test set, first row is input picture, and secondary series is FCIS algorithm segmentation effect, third column EFCIS algorithm segmentation effect, You Shangtu effect of visualization is it is found that EFCIS algorithm can grab Picture small-medium size object is taken, and FCIS algorithm can not almost be accomplished, and large-sized object in, the segmentation of EFCIS algorithm It is more fine.

As shown in table 1, it is based on MS COCO data set, image cutting scheme of the invention improves the mAP of algorithm 2.5%, and flip horizontal and multiple dimensioned trained two schemes make the mAP of algorithm improve 0.6% and 0.8% respectively.

As shown in table 2, it is based on MS COCO data set, the mAP ratio FCIS algorithm of EFCIS algorithm improves 3.5%, especially For Small object, the mAP ratio FCIS of EFCIS of the invention improves 2.9%.

Emulation the above result shows that, the present invention is based on FCIS algorithm propose EFCIS algorithm can effectively improve example The effect of semantic segmentation realizes the effect of 66.8%mAP as IoU=0.5, has compared to the 61.7%mAP of FCIS algorithm aobvious The promotion of work, especially for small size target, the mAP ratio FCIS of EFCIS algorithm improves 2.9%mAP.

In embodiments of the present invention.Fig. 8 is dual RPN algorithm schematic diagram provided by the invention.

Application principle of the invention is further described below with reference to effect.

Algorithm provided by the invention includes that sharing feature figure extracts, and pre-selection frame extracts, and generates position sensing score map, point Class and recurrence；In sharing feature figure extraction process, for the problem that sharing feature is sparse, propose fusion conv1, conv3 and Conv5 characteristic pattern, so that sharing feature figure remains high semantic information and high detailed information；It is down-sampled for convolutional neural networks Caused by Small object Character losing the problem of, propose a kind of image cutting scheme to inhibit Small object Character losing；In pre-selection frame In extraction process, the bad problem of network effect is extracted for pre-selection frame, and proposes dual RPN algorithm, being averaged for it is recalled Rate ratio RPN improves 7%；In summary the mAP ratio FCIS of EFCIS algorithm improves 3.5%, especially for small size target, The mAP ratio FCIS of EFCIS algorithm improves 2.9%.It is shown experimentally that the present invention is very beneficial for promoting the energy of crawl Small object Power.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to Process described in the embodiment of the present invention or function.The computer can be general purpose computer, special purpose computer, computer network Network or other programmable devices.The computer instruction may be stored in a computer readable storage medium, or from one Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center Transmission).The computer-readable storage medium can be any usable medium or include one that computer can access The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims

1. a kind of full convolution example semantic partitioning algorithm of enhancing suitable for small target deteection, which is characterized in that described to be applicable in Include: in the full convolution example semantic partitioning algorithm of the enhancing of small target deteection

Step 1: sharing feature figure extracts, and carries out up-sampling or down-sampling to the resolution ratio of the characteristic pattern of different layers and carries out spy Sign fusion；

Step 2: pre-selection frame extracts, and two RPN networks are respectively trained in conv3 and conv4, the pre-selection generated to two RPN networks Frame carries out comprehensive extraction using non-maxima suppression NMS；

Step 3: generating the score map based on position sensing, generates sharing feature figure based on full convolutional network FCN, reuses 2k²× (C+1)-d ties up the score map that 1 × 1 convolution generates position sensing, and wherein C+1 indicates C target category and 1 background Classification, the position sensing characteristic spectrum of each ROI is by corresponding k²A position sensing map is spliced；

Step 4: classification and recurrence return the ROI position sensing map that step 3 generates, corresponding score map Classify, in classification, the score map of Pixel-level is obtained by the ROI position sensing map of splicing correspondence, it will be every in ROI A pixel, by two 1 × 1 convolution judge respectively the pixel whether in this ROI and the pixel whether in target object In bounds.

2. the full convolution example semantic partitioning algorithm suitable for the enhancing of small target deteection as described in claim 1, feature It is, step 1 specifically includes:

Sharing feature figure extracts, and when input picture the ratio of width to height is constant, short side is adjusted to 600；Using the basis ResNet-101 Network model merges conv1, conv3 and conv5 layers of characteristic pattern to extract sharing feature figure；The characteristic pattern of different layers is used Different up-samplings or down-sampling make the resolution ratio of characteristic pattern consistent；

Use dilation for 2, stride 2 conv1, the expansion convolution that kernel 3, padding are 0 makes conv1 The resolution ratio of the characteristic pattern of layer declines 2 times；Keep conv3 resolution ratio constant；

Conv5 layers use dilation for 2, stride 2, and the deconvolution (Deconv) that kernel 3, padding are 0 makes It obtains conv5 layers of characteristic pattern resolution ratio and up-samples 2 times；

Conv1 is come from using the fusion of 1 × 1 × 512 convolution, conv3, conv5 layers of characteristic pattern finally realizes Fusion Features.

3. the full convolution example semantic partitioning algorithm suitable for the enhancing of small target deteection as described in claim 1, feature It is, step 2 specifically includes:

1) pass through conv4 layers of feature training deep layer RPN network；

4) the RPN network of step 1) and step 3) respectively generates 300 pre-selection frames, using 2 RPN nets of soft-NMS algorithm fusion The pre-selection frame of network, and the parameter IoU that soft-NMS algorithm is arranged is 0.7, will predict that the highest 300 pre-selection frames of score are protected It stays.

4. the full convolution example semantic partitioning algorithm suitable for the enhancing of small target deteection as described in claim 1, feature It is, step 3 specifically includes

Use 2k²1 × 1 convolution of × (C+1)-d generates the score map of position sensing, and wherein C+1 indicates target category and one Background classification, setting (K, C) are (7,80)；And each characteristic spectrum compares down-sampled 16 times of resolution ratio of input picture, each ROI is by corresponding k²A position sensing map combines to be formed.

5. the full convolution example semantic partitioning algorithm suitable for the enhancing of small target deteection as described in claim 1, feature It is, step 4 specifically includes:

The ROI of generation return and its score map is classified, if the IoU of ROI callout box corresponding with its is super 0.5 is crossed, it is positive sample that the ROI, which is arranged, and otherwise, which is marked as negative sample, and uses in carrying out classification regression process One multitask loss function,

L(k,k^*,m,m^*,t,t^*)=L_cls(k,k^*)+L_mask(m,m^*)+L_reg(t,t^*)；

Wherein L_clsIt is the softmax loss function for positive and negative sample pane, L_regFor the recurrence of positive sample frame, L_maskFor just The softmax function of the mask of sample, k and k^*Indicate ROI prediction result and true tag, m and m^*Indicate the prediction of each pixel As a result it and marks；

In recurrence, a positive sample frame vector t=(t_x,t_y,t_h,t_w) and predicted vector t^*=(t_x ^*,t_y ^*,t_h ^*,t_w ^*) and calculate T is to be calculated by the following formula, t_x=(G_x-P_x)/P_w, t_y=(G_y-P_y)/P_h, t_w=log (G_w/P_w), t_h=log (G_h/P_h)；

Wherein P_i=(P_x,P_y,P_h,P_w) it is prediction block, P_h,P_wIndicate that the width of prediction block is high, P_x,P_y, indicate the central point of prediction block Coordinate, same method calculate callout box G；

In classification, ROI is given, the score map of Pixel-level is obtained by the position sensing map of the ROI of splicing correspondence；Pass through one A 1 × 1 × 512 convolution operation is classified.

6. a kind of full convolution example semantic realized described in Claims 1 to 5 any one suitable for the enhancing of small target deteection The computer program of partitioning algorithm.

7. a kind of terminal, which is characterized in that the terminal is at least carried to be suitable for described in realization Claims 1 to 4 any one The controller of the full convolution example semantic partitioning algorithm of the enhancing of small target deteection.

8. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer is executed as weighed Benefit requires the full convolution example semantic partitioning algorithm described in 1~5 any one suitable for the enhancing of small target deteection.

9. a kind of implement to realize the full convolution example for being suitable for the enhancing of small target deteection described in Claims 1 to 5 any one The automatic driving vehicle of semantic segmentation algorithm.

10. a kind of implement to realize the full convolution example for being suitable for the enhancing of small target deteection described in Claims 1 to 5 any one The medical image analytical equipment of semantic segmentation algorithm.

11. a kind of implement to realize the full convolution example for being suitable for the enhancing of small target deteection described in Claims 1 to 5 any one The intelligent robot of semantic segmentation algorithm.