CN109543754B

CN109543754B - Parallel method of target detection and semantic segmentation based on end-to-end deep learning

Info

Publication number: CN109543754B
Application number: CN201811407476.1A
Authority: CN
Inventors: 胡海峰; 尹靓璐
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-11-23
Filing date: 2018-11-23
Publication date: 2023-04-28
Anticipated expiration: 2038-11-23
Also published as: CN109543754A

Abstract

The invention provides a parallel method for target detection and semantic segmentation based on end-to-end deep learning, which is used for obtaining a model consisting of a target detection neural network Darknet-19 and a target segmentation full convolution neural network FCN through training massive marked target detection frames and pixel-level target segmentation images, successfully realizing the tasks of parallel target detection and target segmentation.

Description

Parallel method of target detection and semantic segmentation based on end-to-end deep learning

Technical Field

The invention relates to the field of computer vision of artificial intelligence, in particular to a parallel method of target detection and semantic segmentation based on end-to-end deep learning.

Background

In the field of deep learning object detection, two problems are mainly solved, one is classification and positioning of multiple objects in an image, the development process is divided into three stages, one is a traditional object detection method, and the other is a target detection framework of combining candidate regions (region pro-region) and CNN classification represented by Regions with CNN features (R-CNN), such as: fast R-CNN, thirdly, an End-to-End (End to End) target detection framework represented by You Only Look Once (YOLO) that converts target detection into regression problems, such as: SSD; the traditional method has the defects that the area selection strategy based on the sliding window lacks pertinence and the manually designed characteristics have no robustness to the diversity change; region pro-sampling utilizes the information such as texture, edge, color and the like in the image to find out the possible position of the target in the image in advance, can ensure that a higher recall rate is kept under the condition of selecting fewer windows (thousands or hundreds), but has the defect of time consuming repeated calculation and training, and the subsequent Fast R-CNN and Fast R-CNN make more improvement on the basis of R-CNN, but still have problems in meeting the requirement of video instantaneity; YOLO V2 on the basis of YOLO V1, the real-time performance of the model is guaranteed, and the basic network model adopted by YOLO V2 is dark net-19.

In the field of semantic segmentation for deep learning, pixel-level division is mainly performed on multiple types of target objects in an image, and boundary information of the target objects is more clearly divided, and Long et al in 2014, berkeley propose that a full convolutional network Fully Convolutional Networks (FCN) is a convolutional network for dense prediction without a full connection layer. The method can realize effective segmentation for images with any size; the network model combining the target detection and semantic segmentation tasks is less, and most of the neural networks sacrifice the effect of real-time performance on the aim of improving the accuracy.

Disclosure of Invention

Aiming at the problems that the existing deep learning target detection has the defects of lack of pertinence of a region selection strategy based on a sliding window and the characteristics of manual design, has no robustness on diversity change, and the pixel level division is carried out on a plurality of target objects in an image in the semantic segmentation field of deep learning, and the boundary information of the target objects is insufficiently divided, the invention provides a natural language emotion analysis method based on a depth network, which adopts the following technical scheme:

a parallel method of target detection and semantic segmentation based on end-to-end deep learning comprises the following steps:

s1: constructing and training a deep neural network Darknet-19;

s2: constructing and training a full convolutional neural network FCN;

s3: the resulting deep neural network dark-19 and full convolutional neural network FCN are used to perform object classification, localization, and pixel level segmentation on the input image.

Further, the specific process of step S1 is as follows:

s11: collecting pictures containing multiple types of targets to be detected in an applicable scene as a training data set, and carrying out corresponding labeling treatment on the multiple types of target objects in the training data set according to detection and segmentation tasks; the marked picture is used as a standard output reference picture;

s12: model migration is carried out, and the existing partial convolutional neural network model parameters are used as training parameters of an initial convolutional shared network and a detection task part;

s13: firstly, considering the training process of a target detection branch, obtaining a characteristic picture after an input image passes through a part of convolution and pooling network layers of a convolution shared network part shared by FCNs in the Darknet-19 and the FCNs, inputting the characteristic picture into an RPN module of the Darknet-19 to obtain a prediction frame with different anchor points and morphological aspect ratios, and then respectively sending the characteristic picture meeting the requirements of a target area in the prediction frame into a classification module and a regression module of the Darknet-19 to classify and position the target, namely target detection.

S14: the classification module of the Darknet-19 is a fully connected network, the number of output units is N+1, the probability that the target belongs to each class and background in the target area is obtained, then a softmax is used, and finally a target class score Dark_cls_prob is obtained;

s15: the regression module of the Darknet-19 is a fully connected network, the output units are 4*N, four parameters of a predicted frame of the target area target are obtained, the four parameters comprise a horizontal axis starting point, a vertical axis starting point and the distance between the two starting points and an anchor point, and finally, the predicted frame parameters are corrected by a correction unit, and a predicted frame coordinate Dark_bbox_pred of the target is output;

s16, inputting a picture with a target object and a corresponding marked picture into the deep neural network Darknet-19, and adjusting parameters of the deep neural network Darknet-19 by using a random gradient descent method based on a target detection result output by the deep neural network Darknet-19, wherein the process is repeated until the deep neural network Darknet-19 meets the requirements.

Further, the specific process of step S2 is as follows:

s21: taking the last three layers of networks of the FCN as a subsequent processing module of the segmentation network, wherein the three layers of networks comprise two layers of convolution networks and one layer of deconvolution network;

s22: taking the training process of a target segmentation branch into consideration, obtaining a characteristic picture after an input image passes through a convolution shared network part shared by the FCN in the Darknet-19 and the FCN, wherein the characteristic picture passes through a final three-layer network of the FCN, two layers of convolution networks which do not change the size of the characteristic picture and only change the number of channels, and finally performing double up-sampling on the characteristic picture through a layer of deconvolution network to obtain Conv_FCN_out;

s23: fusing an output characteristic picture Pool4_out of a fourth pooling layer in a convolution shared network part shared by the FCN in the Darknet-19 and the FCN in the previous step with Conv_FCN_out, and then carrying out up-sampling twice through a layer of deconvolution network to obtain Deconv_Pool3_out;

s24: fusing the output characteristic picture Pool3_out of the FCN in the Darknet-19 and the third pooling layer in the convolution shared network part shared by the FCN with the fusion output of the last step to obtain Deconv_Pool4_out, and then performing eight-time up-sampling to the original picture size through a layer of deconvolution network to finally obtain Conv_seg_out; the deconvolution core of a deconvolution network is initialized by a bilinear interpolation method, and learning is performed in training;

s25: when the target detection and segmentation parallel network is trained, partial neural network parameters of a deep neural network Darknet-19 of a target detection branch are used for randomly initializing parameters of all layers of a full convolution neural network FCN of the target segmentation branch by using parameters in a migration model, and finally, a reverse conduction algorithm for reducing a loss function is adopted for synchronously training the Darknet-19 target detection network and the FCN target segmentation network of the whole target detection and segmentation parallel network.

Further, the specific process of step S3 is as follows:

s31: the target detection branch network dark-19 and the target split branch network FCN in steps S1 and S2 have convolutionally shared parts and are co-trained, thus speeding up the training time of the whole detection split network.

S32: inputting the applicable pictures into a convolution shared network part shared by FCNs in the Darknet-19, and continuously processing the obtained characteristic pictures by an RPN module of the Darknet-19 to obtain classification scores and detection frames of applicable targets in the applicable images through classification and regression; the feature picture is in turn derived from the full convolution layer of the FCN of the target split branch network as a split picture at the pixel level.

Further, the training of the parallel network including the target detection branch network dark-19 and the target segmentation branch network FCN adopts a total loss function composed of a coordinate regression function, a classification cross entropy function and a mask loss function, and model parameters are updated by minimizing the loss function and reverse conduction errors.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a parallel method of target detection and semantic segmentation based on end-to-end deep learning, which is used for obtaining a model consisting of a target detection neural network Darknet-19 and a full convolution neural network FCN through training massive marked target detection and target segmentation images and realizing detection and pixel level segmentation of a target object of any input test image. The method and the device can better extract the detail features and the global features in the picture, and realize a parallel and effective real-time target detection and target segmentation task on the premise of ensuring the detection precision and the segmentation precision by processing the target detection part task by the Darknet-19 and carrying out pixel-level target division on the image by the FCN module.

Drawings

Fig. 1 is an overall flowchart of a parallel network based on deep network object detection and object segmentation provided by the invention.

Fig. 2 is a specific network structure of the object detection split network.

Fig. 3 shows the position of the prediction frame in the 3×4 grid in embodiment 1.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only some embodiments of the present invention, which are only for illustration and not to be construed as limitations of the present patent. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1-2, a parallel method of object detection and semantic segmentation based on end-to-end deep learning includes the following steps:

s1 construction and training of Darknet-19 of target detection branch

S1.1, acquiring an image: downloading a PASCAL VOC data set on the Internet, which provides a whole set of standard excellent data sets for image detection and image segmentation, and performing fine adjustment and test of a model by using the data set;

s1.2 preprocessing a target data set: preprocessing the picture by using common scale transformation, random clipping, noise adding, rotation transformation and the like, so as to enhance the robustness of the model; the input scale of the target data set is changed from the initial fixed size into a size of n x 32 randomly, and the value range of n is 9-19; default input scale 618;

s1.3, generating a characteristic picture: the basic network structure adopted by the Darknet-19 does not contain a fully connected network, so that the input picture is allowed to be dynamically adjusted in scale in the first step, and meanwhile, a normalization operation is added after each roll layer to serve as improvement, so that the convergence rate of the network is improved; the idea of a residual error network is absorbed, fine-granularity characteristics are adopted, a characteristic picture obtained by a shallow convolutional neural network and a characteristic picture obtained by a high convolutional neural network are connected, namely adjacent characteristics are overlapped to different channels, and the detection precision of a model in detecting small-scale objects is improved;

s1.4, cutting pictures: after the target picture passes through the basic neural network, the size of the finally obtained characteristic picture is reduced to 1/32 of the original picture due to the pooling parameter of 32, and n grids are divided from the picture at the moment; if the size of the input image of dark-19 is 618×618, the number of feature image grids output is 19×19.

S1.5: generating a prediction frame: the method for generating regional suggestion by referring to the R-CNN series target detection network uses the central point of each grid as an anchor point to predict a certain number of prediction frames with a certain length-width ratio; each prediction frame contains five parameters, namely four coordinate positions and a category score; if dark defaults to generate nine prediction frames for each grid, each picture contains n×9×9 (5+C) parameters, and C is the number of target categories to be predicted.

S1.6: correcting the target position: the target position predicted by Darknet-19 is the coordinate position of each prediction frame relative to the grid, and the coordinate parameter is between 0 and 1, as shown in FIG. 3, and the parameter t obtained by using each prediction frame _x 、t _y 、t _w 、t _h And t _o The edge distance of the grid distance image corresponding to the prediction frame is c _x 、c _y Prediction frame time width parameter p _w 、p _h Finally, the coordinates, the length correction value, the width correction value and the score b of the prediction frame belonging to a certain category of the corrected prediction frame center point on the vertical axis and the horizontal axis are obtained _x 、b _y 、b _w 、

b _x ＝δ(t _x )+c _x

b _y ＝δ(t _y )+c _y

b _h And delta (t) _o ) The calculation formula is

P _r(object) *IOU(b,object)＝δ(t _o )

，P _r(object) For the unrefined score belonging to a class, IOU (b, object) is the overlap area of the prediction box and the label box.

S1.7: screening a prediction frame: screening all the prediction frames by using a non-maximum inhibition method to obtain detection frames meeting the conditions, and giving the score of the object belonging to a certain class to the object in each detection frame.

S2 constructing and training full convolution network FCN of target segmentation branch

S2.1, sharing the step with S1.1, and downloading the PASCALVOC data set on the Internet, wherein the PASCALVOC data set provides a whole set of standard excellent data sets for image detection and image segmentation;

s2.2, adjusting the existing trained convolutional neural network model Alex Net to obtain a preliminary full convolutional network model;

s2.3, deleting a classification layer of the Alex Net convolutional neural network, converting a full connection layer into a convolutional layer, sharing the operations of S1.1-S1.3 with S1, and using the last two convolutional layers and one deconvolution layer of the adjusted Alex Net;

s2.4, up-sampling the output result of the convolution layer of the highest layer by 2x to obtain up-sampling prediction of the layer, wherein the prediction result contains rough segmentation information of the image. The result is conv_fcn_out;

s2.5, carrying out convolution kernel 1 multiplied by 1 convolution operation on the upper pooled layer, namely the fourth layer of the shared convolution, so as to obtain the detail segmentation information of the image segmentation contained in the prediction of the pooled layer, wherein the result is that: pool4 out;

s2.6, summing the two prediction results Conv_FCN_out and Pool4_out, and then performing 2x up-sampling to obtain up-sampling prediction Deconv_Pool4_out;

s2.7, performing convolution kernel 1 multiplied by 1 convolution operation on the last pooling layer, namely the third layer of the shared convolution, so as to obtain the predicted detail information of the pooling layer 3 with more images than the pooling layer 4, wherein the result is that: pool3 out;

s2.8, summing Deconv_Pool4_out and Pool3_out to obtain Deconv_Pool3_out, and then 8x up-sampling to obtain a dense prediction result Conv_seg_out with more detail information, which has the same size as the original input image:

s2.9, initializing deconvolution cores of the upper sampling layer by bilinear interpolation, and learning in training;

s2.10, inputting an image with a standard feature map, carrying out full-network by utilizing random gradient descent, and carrying out fine adjustment on parameters of all layers in the full-convolution neural network to obtain better target detection and segmentation.

S3, performing target detection and target segmentation in parallel

S3.1: in the steps S1 and S2, the two networks Darknet-19 and the FCN have convolution sharing parts, the parts use the parameters of the trained model, so that the calculation memory is saved, the two networks are parallel in the training and detecting stages, and the training and testing time of the whole detection segmentation network is shortened.

S3.2: inputting the applicable pictures into a convolution sharing network shared by FCNs in the Darknet-19 and the FCNs, and continuously processing the obtained characteristic pictures by an RPN module of the Darknet-19 to obtain classification scores and detection frames of applicable targets in the applicable images through a classification module and a regression module of the Darknet-19; obtaining a pixel-level segmentation picture by a full convolution layer of the FCN of the target segmentation branch; the training of the whole model adopts a coordinate regression function, a loss function formed by classifying cross entropy functions and a total loss function formed by mask loss functions, and the parameters of the parallel network for target detection and target segmentation are updated by minimizing the loss function and reverse conduction errors.

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. A parallel method of target detection and semantic segmentation based on end-to-end deep learning is characterized by comprising the following steps:

s1: constructing and training a deep neural network Darknet-19;

s2: constructing and training a full convolutional neural network FCN;

s3: performing target classification, positioning and pixel level segmentation on an input image by using the obtained deep neural network Darknet-19 and the full convolution neural network FCN;

the specific process of the step S1 is as follows:

s13: firstly, considering the training process of a target detection branch, obtaining a characteristic picture after an input image passes through a part of convolution and pooling network layers of a convolutional shared network part shared by FCNs in the Darknet-19, inputting the characteristic picture into an RPN module of the Darknet-19 to obtain a prediction frame with different anchor points and morphological aspect ratios, and then respectively sending the characteristic picture meeting the requirements of a target area in the prediction frame into a classification module and a regression module of the Darknet-19 to classify and position the target, namely target detection;

s16, inputting a picture with a target object and a corresponding marked picture into the deep neural network Darknet-19, and adjusting parameters of the deep neural network Darknet-19 by using a random gradient descent method based on a target detection result output by the deep neural network Darknet-19, wherein the process is repeatedly performed until the deep neural network Darknet-19 meets the requirements;

the specific process of the step S2 is as follows:

s25: when training the target detection and segmentation parallel network, using the parameters in the migration model to randomly initialize the parameters of each layer of the full convolution neural network FCN of the target segmentation branch by using the partial neural network parameters of the deep neural network Darknet-19 of the target detection branch, and finally adopting a reverse conduction algorithm for reducing the loss function to synchronously train the Darknet-19 target detection network and the FCN target segmentation network of the whole target detection and segmentation parallel network;

the specific process of the step S3 is as follows:

inputting the picture into a convolution sharing network, and continuously processing the obtained characteristic picture by an RPN module, and obtaining the classification score and the monitoring frame of the target in the picture through classification and regression; and the full convolution layer of the FCN is used for obtaining the pixel-level divided picture.

2. The parallel method of object detection and semantic segmentation based on end-to-end deep learning according to claim 1, wherein the training of the whole object detection and segmentation parallel network adopts a total loss function consisting of a coordinate regression function, a classification cross entropy function and a mask loss function, and model parameters are updated by minimizing the loss function and reverse conduction errors.