CN110941995A

CN110941995A - Real-time target detection and semantic segmentation multi-task learning method based on lightweight network

Info

Publication number: CN110941995A
Application number: CN201911060977.1A
Authority: CN
Inventors: 侯舟帆; 陈龙; 张亚琛
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2020-03-31

Abstract

The invention relates to a real-time target detection and semantic segmentation multi-task learning method based on a lightweight network. The system comprises a feature extraction module, a semantic segmentation module, a target detection module and a multi-scale receptive field module; the feature extraction module selects a lightweight convolutional neural network MobileNet, extracts features through the MobileNet network, sends the features into the semantic segmentation module to finish the segmentation of the drivable road area and the selectable driving area, and sends the features into the target detection module to finish the object detection appearing in the road scene; and increasing the receptive field of the characteristic diagram through a multi-scale receptive field module, solving the multi-scale problem by convolution of different scales, and finally performing weighted summation on the loss function of the semantic segmentation module and the loss function of the target detection module to optimize the total module. Compared with the prior art, the method provided by the invention can more quickly and accurately complete two common unmanned perception tasks of road object detection and road driving area segmentation.

Description

Real-time target detection and semantic segmentation multi-task learning method based on lightweight network

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a real-time target detection and semantic segmentation multi-task learning method based on a lightweight network.

Background

Computer vision is becoming increasingly popular in autonomous driving, mainly due to the rise of deep learning techniques based on neural networks. The advent of more and more common data sets and developed hardware resources has prompted related research efforts and further pushed the development of computer vision technology. Many computer vision tasks are used in autonomous vehicles, such as object detection and road segmentation, which are crucial for perceiving the driving environment. The current trend is to continuously improve the accuracy of these tasks while keeping the inference time as short as possible. The model perception accuracy is only met, the model prediction speed is not high, great danger is brought to decision making of the unmanned vehicle, and decision making processing cannot be carried out in time when sudden accidents happen, so that the model needs to have high prediction speed, and the vehicle can be guaranteed to have enough time to make decisions. In addition, the hardware resources of the autonomous vehicle are limited, and it is also an important task to fully utilize the hardware resources. And the object under the road scene has the too big problem of size difference of yardstick, and conventional model can't accurately accomplish the perception problem to big object and little object simultaneously, so can explode many potential problems.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a light-weight network-based multi-task learning method for real-time target detection and semantic segmentation, and more quickly and accurately completes two common unmanned perception tasks of road object detection and road driving area segmentation.

In order to solve the technical problems, the invention adopts the technical scheme that: a multitask learning method based on real-time target detection and semantic segmentation of a lightweight network comprises a feature extraction module, a semantic segmentation module, a target detection module and a multi-scale receptive field module; the feature extraction module selects a lightweight convolutional neural network MobileNet, extracts features through the MobileNet network, sends the features to the semantic segmentation module on the upper layer to finish the segmentation of the drivable road area and the selectable driving area, and sends the features to the target detection module on the lower layer to finish the object detection appearing in the road scene; and increasing the receptive field of the characteristic diagram through a multi-scale receptive field module, solving the multi-scale problem by convolution of different scales, and finally performing weighted summation on the loss function of the semantic segmentation module and the loss function of the target detection module to optimize the total module.

Further, the feature extraction module performs feature extraction on the RGB image through a lightweight convolutional neural network MobileNet; MobileNet employs a deep separable convolution instead of a conventional convolution to reduce the number of model parameters. The MobileNet network has smaller volume, less calculation amount and higher precision, and has great advantages in a lightweight neural network. In the process of extracting the features, the more the feature graph size obtained later is smaller, the larger the receptive field is, and the more abundant the semantic information is. MobileNet adopts deep separable convolution to replace conventional convolution to reduce the model parameter quantity, thereby shortening the model prediction time and lowering the requirement on hardware resources.

Further, taking an SSD detection algorithm as a detection baseline model, and adding a multi-scale receptive field module into a target detection module; the multi-scale receptive field module is composed of cavity convolutions with different proportions, and the multi-scale receptive field is increased under the condition that the sizes of the cavity convolutions with different scales are not changed to solve the multi-scale problem. In addition, a multi-scale reception field module is added into a target detection module, the multi-scale reception field module is formed by convolution of holes with different proportions, and the convolution of the holes with different scales increases the multi-scale reception field under the condition that the size of the scales is not changed so as to solve the multi-scale problem. The cavity convolution with the ratio of 5 and the cavity convolution with the ratio of 7 are respectively used for increasing the receptive field of a large-scale object, the cavity convolution with the ratio of 3 is used for increasing the receptive field of a small object, and meanwhile, the convolution layers with different sizes are finally combined together, so that the problem of multiple scales commonly existing in a road scene is well solved.

Furthermore, the features extracted by the backbone network MobileNet are sent to a semantic segmentation module on the upper layer to complete the segmentation of the drivable road area and the selectable driving area, the feature maps on the first two layers are merged, a multi-scale receptive field module is also added into the semantic segmentation module, and the feature maps on the second layer are subjected to cavity convolution with different ratios. The features extracted by the backbone network MobileNet are sent to a semantic segmentation module on the upper layer to complete the segmentation of the drivable road area and the selectable driving area, and the merging operation is carried out on the feature maps on the first two layers, so that the semantic information is increased under the condition of ensuring the scale of the feature maps. And similarly, a multi-scale receptive field module is also added into the semantic segmentation module, hole convolutions with different ratios are adopted for the second layer of feature maps, expansion convolutions with the ratios of 1, 3 and 6 are respectively selected to solve the multi-scale problem, and finally, the feature maps are combined together and then are decoded to finish the segmentation of the road driving area.

Furthermore, the multi-scale reception field module added in the target detection module is used for increasing the reception field of the large-scale object by using the cavity convolution with the ratio of 5 and 7, increasing the reception field of the small object by using the cavity convolution with the ratio of 3, and simultaneously adopting convolution layers with different sizes to be finally combined together.

Furthermore, a multi-scale receptive field module is added into the semantic segmentation module, expansion convolution with the ratio of 1, 3 and 6 is respectively selected to solve the multi-scale problem, and finally, the feature maps are merged together and then are decoded to finish the segmentation of the road driving area.

Further, a Loss function of the multi-task learning is obtained by weighted summation of the Loss functions of all branches, the Loss function of the detection branch is obtained by adding the regression Loss to the classification Loss, and the Loss detection is Loss classification + Loss regression; the Loss function of the splitting branch is Loss split weight [ class ]. crossEntorpyLoss (x, class); finally, Loss function Loss total is Loss detection + Loss segmentation; by optimizing the total Loss, iterative training and back propagation are carried out, and finally Loss convergence is finished and model training is finished. In order to balance the loss of the two labels, i.e., the travelable region and the selectable travel region, it was found through experiments that the best segmentation result can be obtained when weight [ label ═ selectable travel region ] > is 3.

Further, the training step of the model comprises:

s1, a data set BDD100K disclosed by Berkeley is used as training data, data of a road object detection task comprise 10 types of 2D boundary frames, and a drivable region segmentation task comprises two different types: "directly drivable" zones "and" other drivable zones "; the data are processed according to the following steps of 8: 1: 1 dividing the training data into corresponding training data, verification data and test data; the BDD100K is a well-labeled data set for road object detection, instance segmentation, travelable region segmentation and lane marker detection.

S2, extracting features through a lightweight convolutional neural network MobileNet, and training parameters of a backbone network MobileNet, detection branches and parameters of division branches;

s3, carrying out one-time verification through a verification set after each iteration of model training for ten times, and taking the model with the best effect on the verification set as a final model;

and S4, testing the final model on a test set, wherein the test effect is consistent with the effect on the verification set.

The model training is completed, and after the test is free of problems, the model compression can be performed and is arranged on the unmanned vehicle, the size of the model which is not compressed is only 34M, and hardware resources are well saved.

Compared with the prior art, the beneficial effects are:

1. the multi-task learning method based on the MobileNet and combined training of target detection and semantic segmentation uniformly sends the extracted features into a detection branch and a segmentation branch, and simultaneously solves the problem of road object detection and segmentation of a drivable road area by using a single model;

2. when the road environment is sensed by the object, the object detection is relatively time-consuming. The method adopts a single-stage detector, aims at the problem of large object size difference in a road scene, selects an SSD detection method as a reference method, and quickly and accurately detects the road object;

3. before target detection and semantic segmentation are carried out, a multi-scale reception field module is introduced, is formed by convolution layers with different sizes and corresponding cavity convolutions with different proportions, carries out multi-scale feature fusion, and well solves the multi-scale problem, for example, the problem that objects with large scale size difference, such as people walking on roads and buses, cannot be accurately detected at the same time;

4. in conclusion, compared with the prior art, the method provided by the invention can more quickly and accurately complete two common unmanned perception tasks of road object detection and road driving area segmentation.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

FIG. 2 is a diagram of a multi-scale receptor field module according to the present invention.

Detailed Description

The drawings are for illustration purposes only and are not to be construed as limiting the invention; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the invention.

As shown in fig. 1 and 2, a light-weight network-based multi-task learning method for real-time target detection and semantic segmentation includes a feature extraction module, a semantic segmentation module, a target detection module, and a multi-scale receptive field module; the feature extraction module selects a lightweight convolutional neural network MobileNet, extracts features through the MobileNet network, sends the features to the semantic segmentation module on the upper layer to finish the segmentation of the drivable road area and the selectable driving area, and sends the features to the target detection module on the lower layer to finish the object detection appearing in the road scene; and increasing the receptive field of the characteristic diagram through a multi-scale receptive field module, solving the multi-scale problem by convolution of different scales, and finally performing weighted summation on the loss function of the semantic segmentation module and the loss function of the target detection module to optimize the total module.

Specifically, the feature extraction module performs feature extraction on the RGB image through a lightweight convolutional neural network MobileNet; MobileNet employs a deep separable convolution instead of a conventional convolution to reduce the number of model parameters. The MobileNet network has smaller volume, less calculation amount and higher precision, and has great advantages in a lightweight neural network. In the process of extracting the features, the more the feature graph size obtained later is smaller, the larger the receptive field is, and the more abundant the semantic information is. MobileNet adopts deep separable convolution to replace conventional convolution to reduce the model parameter quantity, thereby shortening the model prediction time and lowering the requirement on hardware resources.

The method comprises the following steps of taking an SSD detection algorithm as a detection baseline model, and adding a multi-scale receptive field module into a target detection module; the multi-scale receptive field module is composed of cavity convolutions with different proportions, and the multi-scale receptive field is increased under the condition that the sizes of the cavity convolutions with different scales are not changed to solve the multi-scale problem. In addition, a multi-scale reception field module is added to the target detection module, as shown in fig. 2, the multi-scale reception field module is formed by convolution of holes with different proportions, and the convolution of the holes with different scales increases the multi-scale reception field under the condition that the size of the scales is not changed to solve the multi-scale problem. The cavity convolution with the ratio of 5 and the cavity convolution with the ratio of 7 are respectively used for increasing the receptive field of a large-scale object, the cavity convolution with the ratio of 3 is used for increasing the receptive field of a small object, and meanwhile, the convolution layers with different sizes are finally combined together, so that the problem of multiple scales commonly existing in a road scene is well solved.

In addition, the features extracted by the backbone network MobileNet are sent to a semantic segmentation module on the upper layer to complete the segmentation of the drivable road area and the selectable driving area, as shown in FIG. 1, merging operation is carried out on the feature maps on the first two layers, a multi-scale receptive field module is also added into the semantic segmentation module, and cavity convolution with different ratios is carried out on the feature maps on the second layer. The features extracted by the backbone network MobileNet are sent to a semantic segmentation module on the upper layer to complete the segmentation of the drivable road area and the selectable driving area, and the merging operation is carried out on the feature maps on the first two layers, so that the semantic information is increased under the condition of ensuring the scale of the feature maps. And similarly, a multi-scale receptive field module is also added into the semantic segmentation module, hole convolutions with different ratios are adopted for the second layer of feature maps, expansion convolutions with the ratios of 1, 3 and 6 are respectively selected to solve the multi-scale problem, and finally, the feature maps are combined together and then are decoded to finish the segmentation of the road driving area.

The Loss function of the multi-task learning is obtained by weighted summation of the Loss functions of all branches, the Loss function of the detection branch is the sum of the classification Loss and the regression Loss, and the Loss detection is Loss classification + Loss regression; the Loss function of the splitting branch is Loss split weight [ class ]. crossEntorpyLoss (x, class); finally, Loss function Loss total is Loss detection + Loss segmentation; by optimizing the total Loss, iterative training and back propagation are carried out, and finally Loss convergence is finished and model training is finished. In order to balance the loss of the two labels, i.e., the travelable region and the selectable travel region, it was found through experiments that the best segmentation result can be obtained when weight [ label ═ selectable travel region ] > is 3.

In this embodiment, the training step of the model includes:

Example 1

When the multi-task learning method based on real-time target detection and semantic segmentation is implemented, firstly, training data are prepared, data and test data are verified, then model training and testing are carried out, and finally the model is deployed on an unmanned vehicle.

1) Training data, verifying data and preparing and processing test data;

step 1, according to the proportion of 8: 1: 1, dividing a BDD100K data set to obtain a corresponding training set, a verification set and a test set;

step 2, counting the dimension of each image detection object in the training set, so as to facilitate subsequent verification;

and 3, performing data enhancement, picture turnover, picture cutting, brightness saturation change and normalization processing on the training data to fully utilize the data.

2) Detailed process of model training:

step 11, using the pyrorch as a deep learning frame, pre-training the MobileNet on ImageNet1K, and selecting a MobileNet model with the best effect as a pre-training model;

step 2, the training equipment selects 4 Titan Xp as an experimental GPU, the video memory of each video card is 12GB, the more the GPUs are, the more the batch _ size is, and the better the trained model effect is;

step 3, model training parameters are mainly obtained by carrying out transfer learning on the MobileNet backbone network, carrying out fine adjustment on the MobileNet parameters, randomly initializing the parameters of the detection branch and the segmentation branch according to Gaussian distribution, and training from the random initialization of the parameters;

step 4, using SGD to perform gradient descent, setting the batch _ size of each GPU as 28, setting the weight attenuation as 0.0005, and setting the learning rate as 0.004 to perform 30 rounds of training; the model loss function is a weighted sum of the detection loss function and the segmentation loss function, and the segmentation loss function coefficient is set to be 3 to obtain the best model result through multiple experimental result verification;

step 5, selecting the model with the best result on the verification set as a final model, and continuing to compress the model if necessary to further reduce the hardware requirement;

3) the processed model is deployed on an unmanned vehicle, the road scene is verified, the object type with poor indexes is further optimized by debugging and observing indexes of the model for detecting and dividing the model on each object type, and the detection of the road object and the division of the drivable area and the selectable driving area of the front road can be completed through the camera after debugging.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A multitask learning method based on real-time target detection and semantic segmentation of a lightweight network is characterized by comprising a feature extraction module, a semantic segmentation module, a target detection module and a multi-scale receptive field module; the feature extraction module selects a lightweight convolutional neural network MobileNet, extracts features through the MobileNet network, sends the features to the semantic segmentation module on the upper layer to finish the segmentation of the drivable road area and the selectable driving area, and sends the features to the target detection module on the lower layer to finish the object detection appearing in the road scene; and increasing the receptive field of the characteristic diagram through a multi-scale receptive field module, solving the multi-scale problem by convolution of different scales, and finally performing weighted summation on the loss function of the semantic segmentation module and the loss function of the target detection module to optimize the total module.

2. The multitask learning method for real-time target detection and semantic segmentation based on the lightweight network as claimed in claim 1, wherein the feature extraction module performs feature extraction on the RGB image through a lightweight convolutional neural network (MobileNet); MobileNet employs a deep separable convolution instead of a conventional convolution to reduce the number of model parameters.

3. The multitask learning method based on the real-time target detection and the semantic segmentation of the lightweight network according to claim 1, characterized in that an SSD detection algorithm is used as a detection baseline model, and a multi-scale receptive field module is added in a target detection module; the multi-scale receptive field module is composed of cavity convolutions with different proportions, and the multi-scale receptive field is increased under the condition that the sizes of the cavity convolutions with different scales are not changed to solve the multi-scale problem.

4. The multitask learning method for real-time target detection and semantic segmentation based on the lightweight network as claimed in claim 3, wherein the features extracted by the backbone network MobileNet are sent to the semantic segmentation module on the upper layer to complete segmentation of the drivable road region and the selectable driving region, merging operation is performed on the two previous layers of feature maps, a multiscale field module is also added to the semantic segmentation module, and hole convolution with different ratios is performed on the second layer of feature maps.

5. The method as claimed in claim 3, wherein the multi-scale receptive field module added in the target detection module is used to increase the receptive field of large-scale objects by using a ratio of 5 and a ratio of 7 for hole convolution, increase the receptive field of small objects by using a ratio of 3 for hole convolution, and finally merge the convolutional layers with different sizes.

6. The light-weight-network-based real-time target detection and semantic segmentation multitask learning method according to claim 4, characterized in that a multiscale receptive field module is added into the semantic segmentation module, expansion convolution with the ratio of 1, 3 and 6 is respectively selected to solve a multiscale problem, and finally, after feature maps are combined together, decoding operation is performed to complete segmentation of a road driving area.

7. The method as claimed in any one of claims 2 to 6, wherein the Loss function of the multi-task learning is obtained by weighted summation of the Loss functions of the branches, and the Loss function of the detection branch is the classification Loss plus the regression Loss, Loss_{Detection of}＝Loss_{Classification}+Loss_Regression(ii) a Loss function of the split branch is Loss_Segmentation＝weight[class]*

crossEntorpyLoss (x, class); loss function Loss_{General assembly}＝Loss_{Detection of}+Loss_Segmentation(ii) a By optimizing the total Loss, iterative training and back propagation are carried out, and finally Loss convergence is finished and model training is finished.

8. The light-weight network-based real-time target detection and semantic segmentation multitask learning method according to claim 7, characterized in that the model training step comprises:

s1, a data set BDD100K disclosed by Berkeley is used as training data, data of a road object detection task comprise 10 types of 2D boundary frames, and a drivable region segmentation task comprises two different types: "directly drivable" zones "and" other drivable zones "; the data are processed according to the following steps of 8: 1: 1 dividing the training data into corresponding training data, verification data and test data;