CN111104903B

CN111104903B - Depth perception traffic scene multi-target detection method and system

Info

Publication number: CN111104903B
Application number: CN201911317498.3A
Authority: CN
Inventors: 张登银; 彭巧; 孙誉焯; 周超; 刘子捷
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: China Austria Internet Of Things Technology Nanjing Co ltd
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2022-07-26
Anticipated expiration: 2039-12-19
Also published as: CN111104903A

Abstract

The invention discloses a method and a system for detecting multiple targets in a depth perception traffic scene, which comprises the steps of inputting a picture to be detected to a Mask R-CNN model which is trained in advance to identify the category and the target position of a first type of target; and inputting the identified picture into a pre-trained optimized CNN model, and detecting the class, confidence and target position of a second class of target in the picture. The invention can fully consider the problems of complex traffic scene and serious small target missing detection of the existing target tracking algorithm, provides an optimized CNN model, optimizes the feature extraction network and the detection network on the basis of the advantages of the original CNN, and trains to generate a new model for detecting the small target. The method for detecting the small target on the large target detection result can enhance the detection effect of multiple targets in a traffic scene and improve the accuracy of small target identification.

Description

Depth perception traffic scene multi-target detection method and system

Technical Field

The invention relates to a depth perception traffic scene multi-target detection method and system, and belongs to the technical field of video image processing.

Background

Vision-based traffic scene awareness (TSP) is one of many emerging areas in intelligent traffic systems, and this research area has been widely studied during the past decade. TSPs aim at extracting accurate real-time road information, and generally involve three phases for various research objects contained in an image: detecting, identifying and tracking. Since tracking generally relies on the results of detection and identification, the ability to effectively detect and identify research objects plays a crucial role in TSPs, which has also been a classical problem in identifying multiple target objects in images or videos.

In addition to traditional image processing techniques, CNN is a powerful and efficient method for common image classification recognition detection tasks, extending a number of excellent models and ideas. Early overtureat used a sliding window selection search in ConvNet for classification, localization and detection, and Ross Girshick proposed Region-CNN (R-CNN) by classifying objects using deep ConvNet. Due to the computational temporal and spatial shortcomings, he has employed pooling layers on Fast region-based convolutional networks (Fast-RCNN) to improve speed and detection accuracy. Later, more efficient Faster R-CNN was proposed based on the above, which introduced a new area proposal network directly to obtain candidate areas. The Mask R-CNN prototype with Faster R-CNN added a branch for the segmentation task. The architecture of the series of models has several similarities, one is that they are backbone networks of CNN, and originate from basic CNN; another is to add some extra proposed layers, such as ROI pool and RPN layer, which can effectively handle the feature map of the backbone CNN.

As a typical deep learning model, CNN can obtain excellent performance in object detection due to its strong feature extraction capability, but for some important small visual objects, such as license plates, passengers in a vehicle, etc., their labels and information are insufficient, which increases the difficulty of traffic scene information acquisition and deep learning development.

Disclosure of Invention

The invention aims to solve the problems of insufficient labels and information of some important small visual objects such as license plates, passengers in a vehicle and the like in the prior art, and provides a method and a system for detecting multiple targets in a deep perception traffic scene.

The invention adopts the following technical scheme:

a multi-target detection method for traffic scene perception comprises the following steps:

inputting a picture to be detected into a Mask R-CNN model which is trained in advance, and extracting the category and the target position of a first type of target;

and (4) pre-training the recognized picture input value to the optimized CNN model, and detecting the class, the confidence coefficient and the target position of the second class of targets in the picture.

Further, the optimized CNN model comprises a feature extraction network and an object detection network, wherein the feature extraction network is used for detecting the input features of the pictures to obtain a feature map; and the object detection network detects the picture to be detected and outputs the category, the confidence coefficient and the target position of the second type of targets in the picture.

Further preferably, the optimized CNN model includes a feature extraction network and an object detection network, the feature extraction network structure includes 8 layers, and from layer 1 to layer 8, a first convolutional neural network layer, a first maximum pooling layer, a second convolutional neural network layer, a third convolutional neural network layer, a second maximum pooling layer, a fourth convolutional neural network layer, a fifth convolutional neural network layer, and a third maximum pooling layer are respectively provided;

the object detection network comprises three layers, wherein the first layer is a sixth convolutional neural network layer, the second layer is a seventh neural network layer and an eighth neural network layer of two convolutional neural network layers which are parallel, the seventh neural network layer and the eighth neural network layer are simultaneously connected with the sixth neural network layer, the third layer is a ninth neural network layer and a tenth neural network layer which are respectively connected with the seventh neural network layer and the eighth neural network layer, the ninth neural network layer outputs the confidence coefficient and the target position of a target, and the tenth neural network layer outputs the category of the target. Preferably, the first convolutional neural network layer is a normalization layer.

On the basis of the above technical solution, it is further preferable that the first convolutional neural network layer kernel adopts 11 × 11, and the first convolutional neural network layer first plays a role in the input image to retain low-level but rich details. The second convolutional neural network layer and the third convolutional neural network layer as well as the fourth convolutional neural network layer and the fifth convolutional neural network layer are 3x3 convolutional layers, and by using a deconvolution method of two 3x3 convolutional layers, fewer parameters can be introduced, so that simplified overfitting can be realized, stronger functions can be expressed by fewer parameters, and then batch normalization is performed. The role of the max pooling layer is to compute the maximum value in each identified n × n region to enable image downsampling. It helps to simplify the network computational complexity, compress the input feature map and extract the main features.

The object detection network comprises three layers, wherein the first layer is a sixth convolutional neural network layer, the second layer is a seventh neural network layer and an eighth neural network layer which are two convolutional neural network layers in parallel and are simultaneously connected with the sixth neural network layer, the third layer is a ninth neural network layer and a tenth neural network layer which are respectively connected with the seventh neural network layer and the eighth neural network layer, the ninth neural network layer outputs the confidence coefficient and the target position of a target, and the tenth neural network layer outputs the category of the target. Preferably, the first convolutional neural network layer is a normalization layer. Wherein the seventh neural network layer and the ninth neural network layer are convolutional layers with a kernel of 1 × 1.

In the technical scheme, the feature extraction network designs a network integrating different convolution layers, a local normalization layer and a maximum pooling layer, and acquires detailed features of a target as much as possible to obtain a feature map of an image to be detected; the feature map is input into the detection network, pixel-level target features acquired from the feature image are input, the targets in the image can be classified and positioned element by element, a predicted object boundary is generated, and a difference value between a predicted boundary frame and a ground truth is output.

In another aspect, the invention provides a depth-aware traffic scene multi-target detection system, which is characterized in that,

the Mask R-CNN model is used for inputting and identifying the category and the target position of a first type of target in the picture to be detected;

and the optimized CNN model is used for detecting the type, confidence degree and target position of a second type of target in the picture from the picture which is identified by the input Mask R-CNN model.

Further, the optimized CNN model comprises a feature extraction network and an object detection network, wherein the feature extraction network is used for detecting input features of pictures to obtain a feature map; and the object detection network detects the picture to be detected and outputs the category, the confidence coefficient and the target position of the second type of targets in the picture.

The invention achieves the following beneficial technical effects:

firstly, the invention adopts Mask R-CNN to detect the large target object, and obtains the large target object which can be clearly detected in each picture. The invention selects the network of Mask R-CNN, which can not only detect the objects, but also segment them from the input image, but the invention only keeps the object with larger size and clear segmented by the Mask R-CNN, because the object with smaller size and unclear size can cause the object to be identified wrongly;

second, the present invention employs an optimized feature extractor and detector for small object detection. The core of the feature extractor is a network integrating different convolution layers, a local normalization layer and a maximum pooling layer, and aims to acquire detailed features of small targets as much as possible; the core of the detector is to use a 1 × 1 convolution kernel instead of the normal fully-connected layer. Since such 1 x 1 convolution kernels contain local receive domains, they can be slid over a larger input image to obtain multiple outputs regardless of the different size input images. Therefore, the conversion improves the efficiency of forward propagation of the neutral network, enhances the learning capability of the CNN, and saves a large amount of time overhead.

Drawings

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention.

FIG. 2 is a diagram of a MASK R-CNN model architecture employed in an embodiment of the present invention;

fig. 3 is a training flow diagram of an optimized CNN algorithm for small target detection in an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The method combines Mask R-CNN and an improved optimization model based on CNN, and is suitable for multi-target detection. The feature extraction network of the optimized small target detection network part learns a large amount of information from fine-grained details of a lower layer, and enriches the representation of the small target by reasonably increasing the size of the feature map and down-sampling. The detection part is trained by fully utilizing the convolutional layer, so that the input characteristics of various targets can be well classified and detected. The invention further optimizes the special detector in a full convolution mode and applies the deep learning to the traffic scene. The optimized CNN detector overall architecture is composed of a feature extractor and a detector, wherein the feature extractor is composed of different convolution layers, a maximum pooling layer and a local normalization layer, and the detector is mainly used for classification and positioning tasks of 'Softmax with loss' and bounding box regression. A standard fully connected layer will typically introduce a large number of parameters. Therefore, it is effective to replace the full-link layer with a 1 × 1 convolution kernel, which is advantageous in reducing the amount of calculation.

Fig. 1 is a flowchart of a method according to an embodiment of the present invention, in which a format of the PASCAL VOC data set and an evaluation algorithm tool are used to select four types of target objects: and carrying out format conversion on vehicles, people, traffic signs and license plates to generate a training set. And acquiring pictures needing to be tested in the actual life, and manufacturing a test set according to the method for generating the data set.

The practice of the present invention is further illustrated below with reference to fig. 1 and examples. Fig. 1 shows a depth-aware traffic scene multi-target detection method provided by the embodiment, which includes:

s1, adopting the format of the PASCAL VOC data set and an evaluation algorithm tool. First, the category of KITTI is switched: the PASCAL VOC has 20 categories in total, in an urban traffic scene, the key detection objects are four types, namely vehicles, people, traffic signs and license plates, so that the data set is divided into the 4 categories; secondly, converting the labeling information: converting the tagged file from txt to xml, removing other information in the tag, and only leaving four types of vehicles, people, traffic signs and license plates; finally, a required training set is generated. Similarly, the pictures to be tested in the real life are collected, and the test set of the embodiment is generated according to the method.

S2, aiming at the large target contained in the image, the embodiment inputs the training set into the original Mask R-CNN network for training, and generates a network model, as shown in FIG. 2. In the embodiment, a network such as Mask R-CNN is selected, so that not only can objects be detected, but also the objects can be segmented from an input image, but only the large-size and clear object segmented by the Mask R-CNN, namely the first-class object, is reserved, and the small-size and unclear object can cause the object to be identified incorrectly. It should be noted that the Mask R-CNN network is the prior art, and the construction and training method thereof is common knowledge in the art and will not be described herein.

S3, secondly, for small-sized objects with insufficient labels and information, the embodiment inputs the training set in the PASCAL VOC data set into the network architecture of the optimized CNN detector designed in the embodiment to perform training, and generates a network model. The network structure is divided into two parts, a feature extraction network and a detection network.

(1) The feature extraction network portion, the present embodiment, uses a network integrating different convolutional layers, local normalization layers, and max pooling layers, as shown in fig. 3.

Fig. 3 shows that the feature extraction network structure includes 8 layers, from layer 1 to layer 8, which are a first convolutional neural network layer, a first maximum pooling layer, a second convolutional neural network layer, a third convolutional neural network layer, a second maximum pooling layer, a fourth convolutional neural network layer, a fifth convolutional neural network layer, and a third maximum pooling layer, respectively;

preferably, the first convolutional neural network layer is a normalization layer.

The plurality of convolution layers with nonlinear activation functions help to enhance the capability of nonlinear expansion, and compared with a single convolution layer, the method can correctly process multiple targets in the image and acquire detailed characteristics of the targets as much as possible. The network gradually deepens from Conv1, the representation of small targets in the image, namely the second type of targets (such as license plates, passengers in the vehicle and the like) is demonstrated in a smaller dimension, and the output result is a feature map which needs to be input by the next detection part of the embodiment. One large kernel (11 × 11) in Conv1 works first in the input image to preserve low-level but rich details. The resulting feature is then passed into two 3x3 convolutional layers, which, as shown in fig. 2, decompose two 5x5 convolutional layers into two 3x3 convolutional layers, conv2 and conv3, and two 3x3 convolutional layers, conv4 and conv 5. Here, the advantage of this embodiment of replacing the 5 × 5 kernel in VGG Net with two smaller consecutive 3 × 3 convolutional layers is: firstly, the structure of the plurality of convolution layers has a nonlinear function, which is helpful for enhancing the nonlinear expansion capability, and can extract deeper features than that of a single 5 × 5 convolution layer; second, using the method of deconvolution of two 3 × 3 convolutional layers, fewer parameters can be introduced, because the present embodiment assumes that the input and output channels of the convolutional layers are C and D, respectively, implementing the parameters of a single 5 × 5 kernel as 5 × 5 × C × D = 25 × C × D, while the combined × 3 convolutional layers of two 3 have only 2 × (3 × C × D) = 18 × C × D, which reduces the parameters by 25/18 = 1.4 times. Fewer parameters may enable simplified overfitting and express more powerful functions.

The role of the max pooling layer is to compute the maximum value in each identified n × n region to enable image downsampling. It helps to simplify the network computational complexity, compress the input feature map and extract the main features.

(2) The detection network part is used for finishing classification and positioning tasks. It is divided into two branches, denoted "Output _ type" and "Output _ bbox", respectively.

The "Output _ type" branch actually functions to classify objects at the pixel level. Here the fully connected layers in a traditional network (such as VGG Net) are replaced with two convolutional layers, Conv7 and Conv 9. Thus, the output (excluding the softmax layer) from the transformed network is no longer a category but a heatmap. The next step is to perform element-by-element classification prediction: the maximum numerical probability of the same pixel in 1000 images is calculated pixel by pixel and considered as a pixel class. Finally, the "Softmax lossy" layer is used to compute the lossy function in this task.

The "Output _ bbox" branch implements the localization of the target, which is composed of similar full convolutional layers. It can predict the object boundary and output the predicted boundary box ( x _min ， y _min ， w ， h ) And ground truth.

S4, inputting the test set (to-be-detected pictures) into the trained Mask R-CNN model to detect the type, confidence and target position of a large target in the image, and identifying the pictures of the large target and saving the pictures as a new test set; and inputting the new test set into the trained optimized CNN model, and detecting the category, the confidence coefficient and the target position of the small target in the image.

The optimized CNN model structure parameters in this example are shown in table 1.

TABLE 1 optimized CNN model structural parameters

Another embodiment provides a depth-aware traffic scene multi-target detection system, including:

and the optimized CNN model is used for detecting the class, the confidence coefficient and the target position of a second class of targets in the picture by using the picture which is identified by the input Mask R-CNN model.

On the basis of the above embodiment, further, the optimized CNN model includes a feature extraction network and an object detection network, where the feature extraction network is used to detect input features of a picture to obtain a feature map; and the object detection network detects the picture to be detected and outputs the category, the confidence coefficient and the target position of the second type of target in the picture.

The specific implementation manners of the Mask R-CNN model and the optimized CNN model in this embodiment are the same as those in the above embodiment, and will not be described again.

The invention divides the detection of multiple targets in a traffic scene into: and detecting a large target and a small target. The first part is directed to large targets, comprising: identifying and segmenting target objects in the input image by adopting a Mask R-CNN model for vehicles, traffic signs and pedestrians; the second part is directed to a small target, comprising: a license plate and a passenger in a vehicle provide an optimized CNN model, and the feature extraction network and the detection network are optimized and trained to generate a new model for small target detection based on the advantages of the original CNN network. The method for detecting the small target on the large target detection result can enhance the detection of multiple targets in the traffic scene, improve the accuracy of small target identification and provide a model with good performance for the detection of the multiple targets in the actual traffic scene.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A depth perception traffic scene multi-target detection method is characterized by comprising the following steps:

inputting a picture to be detected into a Mask R-CNN model which is trained in advance to identify the category and the target position of a first type of target;

inputting the recognized picture into an optimized CNN model which is trained in advance, and detecting the category, confidence and target position of a second type of target in the picture;

the optimized CNN model comprises a feature extraction network and an object detection network, wherein the feature extraction network is used for detecting the input features of pictures to obtain a feature map; the object detection network detects the picture to be detected and outputs the category, the confidence coefficient and the target position of the second type of target in the picture;

the object detection network comprises three layers, wherein the first layer is a sixth convolutional neural network layer, the second layer is a seventh neural network layer and an eighth neural network layer which are two convolutional neural network layers in parallel and are simultaneously connected with the sixth neural network layer, the third layer is a ninth neural network layer and a tenth neural network layer which are respectively connected with the seventh neural network layer and the eighth neural network layer, the ninth neural network layer outputs the confidence coefficient and the target position of a target, and the tenth neural network layer outputs the category of the target.

2. The method as claimed in claim 1, wherein the feature extraction network structure includes 8 layers, and from layer 1 to layer 8, there are a first convolutional neural network layer, a first maximum pooling layer, a second convolutional neural network layer, a third convolutional neural network layer, a second maximum pooling layer, a fourth convolutional neural network layer, a fifth convolutional neural network layer and a third maximum pooling layer, respectively.

3. The method for multi-target detection in the deep perception traffic scene as claimed in claim 2, wherein the first convolutional neural network layer is a normalization layer.

4. A depth perception traffic scene multi-target detection system is characterized in that,

the optimized CNN model is used for detecting the type, confidence and target position of a second type of target in an image after the image is identified by the input Mask R-CNN model;

the optimized CNN model comprises a feature extraction network and an object detection network, wherein the feature extraction network is used for detecting input features of pictures to obtain a feature map; the object detection network detects the picture to be detected and outputs the category, the confidence coefficient and the target position of the second type of target in the picture;

5. The system for multi-target detection in a depth-aware traffic scene as claimed in claim 4, wherein the optimized CNN model includes a feature extraction network and an object detection network, the feature extraction network is configured to detect input features of pictures to obtain a feature map; and the object detection network detects the picture to be detected and outputs the category, the confidence coefficient and the target position of the second type of target in the picture.