CN107301383B

CN107301383B - Road traffic sign identification method based on Fast R-CNN

Info

Publication number: CN107301383B
Application number: CN201710421849.XA
Authority: CN
Inventors: 刘兰馨; 李巍华
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2017-06-07
Filing date: 2017-06-07
Publication date: 2020-11-24
Anticipated expiration: 2037-06-07
Also published as: CN107301383A

Abstract

The invention discloses a road traffic sign identification method based on Fast R-CNN, which comprises the following steps: carrying out image acquisition and pretreatment to manufacture a sample set; inputting a training set, and performing multitask training on a Fast R-CNN network; the picture to be identified passes through a plurality of convolution layers and pooling layers to obtain a characteristic diagram; obtaining a corresponding feature frame from the candidate frame, and obtaining two output vectors of classification score and window regression respectively through the ROI pooling layer and the full-connection layer; and (4) performing non-maximum suppression processing on all the results to generate final target detection and identification results, and identifying the traffic sign. The invention adopts a Fast R-CNN deep learning method, avoids redundant feature extraction operation in the regional convolutional neural network R-CNN, realizes multi-task training, does not need additional feature storage space, and improves the detection speed and precision. Compared with a shallow learning classifier, the method has higher learning efficiency and recognition accuracy.

Description

Road traffic sign identification method based on Fast R-CNN

Technical Field

The invention belongs to the field of image processing and automobile safety assistant driving, and particularly relates to a road traffic sign detection and identification method based on Fast R-CNN, which is used for solving the problem of low identification precision in the road traffic sign identification problem.

Background

Traffic Sign Recognition (TSR) is one of the problems that have not been solved at present, and is an important branch in a vehicle-mounted auxiliary system. Because the traffic sign contains a lot of important traffic information, such as speed prompt of current driving, change of road condition in front and restriction of driver behavior, in the auxiliary system, how to quickly, accurately and effectively identify the traffic sign in the road and feed the traffic sign back to the driver or a control system has very important research significance for ensuring driving safety and avoiding traffic accidents.

The common methods for recognizing the road traffic signs comprise a shape-based recognition method, a method combining feature extraction and a classifier and a deep learning recognition method. The shape-based recognition method has poor robustness and poor effect in a complex environment. The method combining the feature extraction and the classifier has good identification effect, but has high calculation cost and poor environment adaptability. The deep learning can directly identify the original image, extract the recessive characteristics reflecting the essence of the data and has enough learning depth. The convolutional neural network has the characteristic of local weight sharing, and has certain real-time performance and robustness for the conditions of complex environment, multi-angle change and the like. Therefore, it is necessary to design an identification method capable of accurately acquiring the road traffic sign in the road scene. A Fast R-CNN algorithm is proposed by Ross B.Girshick in 2015, redundant feature extraction operation in R-CNN (Region-based connected Neural Network) is avoided, multitask training is realized, an additional feature storage space is not needed, and detection speed and precision are improved.

Disclosure of Invention

In order to solve the problems in the prior art, the invention designs a road traffic sign identification method based on Fast R-CNN, which can accurately acquire road traffic signs in a road scene so as to help a driver to better sense the environment outside a vehicle under complex conditions and prevent traffic accidents.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a road traffic sign identification method based on Fast R-CNN comprises the following steps:

carrying out image acquisition and pretreatment to manufacture a sample set;

inputting a training set, and performing multitask training on a Fast R-CNN network;

inputting the picture to be identified into a Fast R-CNN network, and obtaining a characteristic diagram through a plurality of convolution layers and pooling layers;

extracting a plurality of candidate frames by adopting a Selective Search algorithm, finding a feature frame corresponding to each candidate frame in the feature map according to the mapping relation between the candidate frames in the original image and the feature map, and pooling each feature frame to a fixed size in an ROI pooling layer;

the feature frame is processed by a full connection layer to obtain feature vectors with fixed size, and the feature vectors are processed by respective full connection layers to respectively obtain two output vectors of classification scores and window regression;

and performing non-maximum suppression processing on all the results to generate a final target detection and identification result, so that the road traffic sign can be identified.

Further, the image acquisition step specifically includes:

starting a vehicle-mounted automobile data recorder, and shooting road traffic video information in real time;

performing framing processing on video information shot by a camera to obtain an image set sequence;

and screening the image set, and selecting the images containing the road traffic signs.

Further, the image preprocessing and sample set preparation steps specifically include:

in the selected image, taking out the target area, zooming to a fixed size of 224 multiplied by 224 to enhance the contrast, then carrying out contrast enhancement processing on the target area to obtain an original training set, and processing the test set in the same way;

rotating the original training set to the degree of 12 degrees and 12 degrees, zooming to the degree of 0.4 and 1.6, and adding the original training set into the original data set to form a new training set;

and randomly taking samples with the number equivalent to that of the test set in the new data set to form a verification set, and forming the rest samples into a final training set.

Further, the Fast R-CNN network structure comprises: 13 convolutional layers, 4 pooling layers, 1 ROI pooling layer, 2 fully-connected layers and two planar layers.

Further, each feature frame is pooled to a fixed size of 7 x 7 in the ROI pooling layer.

Further, the fully-connected output of the multitask training Fast R-CNN network comprises two branches: a cls _ score layer for classification and a bbox _ pred layer for adjusting candidate box positions.

Further, when the feature vectors pass through the respective full connection layers, the feature vectors are accelerated by Singular Value Decomposition (SVD) to obtain two output vectors, namely, classification score of Softmax and Bounding-box window regression.

Further, for two branches of full-connection output, a classification layer and a regression layer of an output layer are trained by using a random gradient descent method until loss functions of classification and regression are converged.

Further, the step of performing non-maximum suppression processing on all the results specifically includes: according to the two output branches, non-maximum value inhibition is respectively carried out on each class of objects by using window scores to remove overlapping candidate frames, and finally, a window with the highest score after regression correction in each class is obtained.

Further, the road traffic sign comprises a straight arrow, a turning arrow, a left arrow, a right arrow, a straight left arrow, a straight right arrow and a diamond-shaped marked line.

Compared with the prior art, the invention provides a traffic sign detection and identification method based on Fast R-CNN in order to solve at least some problems in the prior art. The method automatically makes a road traffic sign data set, learns the characteristics from the sample through deep learning, can extract the recessive characteristics reflecting the essence of the data, has higher learning efficiency and recognition precision, improves the robustness of the detection algorithm, and effectively improves the accuracy of the detection of the road traffic sign.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, are incorporated in and constitute a part of this application, illustrate only some non-limiting examples of the invention embodying the inventive concept and are not intended to be limiting in any way.

FIG. 1 is a flow diagram of a Fast R-CNN based method of pavement marking identification in accordance with some exemplary embodiments of the present invention.

FIG. 2 is a diagram of a Fast R-CNN network architecture in accordance with some exemplary embodiments of the present invention.

FIG. 3 is a diagram of a multitasking training cost function according to some demonstrative embodiments of the invention.

Fig. 4 is a schematic illustration of a portion of a sample set of traffic signs according to some exemplary embodiments of the invention.

FIG. 5 is a graphical representation of the detection results of Fast R-CNN based pavement marker identification methods according to some exemplary embodiments of the present invention.

Detailed Description

The invention is described in detail below with reference to the drawings and technical solutions.

As shown in the attached figure 1, the invention is a flow chart of a road traffic sign identification method based on Fast R-CNN, and the specific implementation mode of the invention is as follows:

carrying out image acquisition and pretreatment to manufacture a sample set;

extracting about 2000 candidate frames by adopting a Selective Search algorithm, finding a feature frame corresponding to each candidate frame in the feature map according to the mapping relation between the candidate frame and the feature map in the original image, and pooling each feature frame to a fixed size in an ROI (Region of Interest) pooling layer;

the feature frames are processed by full connection layers to obtain feature vectors with fixed sizes, and two output vectors of classification scores and window regression are obtained through respective full connection layers;

and (4) performing non-maximum suppression processing on all the results to generate final target detection and identification results, and identifying the traffic sign.

In some embodiments, the step of image acquisition specifically comprises:

starting a vehicle-mounted automobile data recorder, shooting road traffic video information in real time, and selecting a video image with a video resolution of 1280 x 720 shot by the automobile data recorder;

performing frame processing on the shot video image to obtain an image set sequence;

and screening the image set, and selecting 7 road traffic signs with high occurrence frequency from the image set.

Specifically, the image preprocessing and sample set preparation steps specifically include:

in the selected image, the target area is taken out and is zoomed to a fixed size of 224 multiplied by 224 to enhance the contrast, then the target area is subjected to contrast enhancement processing to form an original training set, and the test set is processed in the same way;

In the example shown in fig. 4, the selected traffic signs can be classified into 7 types, which are respectively straight arrow, u-turn arrow, left arrow, right arrow, straight left arrow, straight right arrow, and diamond-shaped marked line, which are respectively numbered 01, 02, 03, 04, 05, 06, 07, and the recognition results are output in this way.

A VGG16 network proposed in the document "k.simony and a.zisserman.very deep conditional networks for large-scale image registration, 2015" by simony et al, comprising 13 convolutional layers, 5 pooling layers and 3 fully connected layers. On the basis of VGG16, the last layer of the VGG-16 network is replaced by an ROI pooling layer, and the last layer of the VGG-16 network and a softmax layer are replaced by two parallel layers.

As shown in FIG. 2, the resulting Fast R-CNN network structure is as follows: including 13 convolutional layers, 4 pooling layers, 1 ROI pooling layer, 2 fully-connected layers and two planar layers. Image samples of size 224 x 224 are input into the network via the input layer; for all convolutional layers, the convolutional kernel size is 3 × 3, step size is 1; for all pooling layers, a 2 × 2 sample size was used, and the pooling approach employed maximum pooling. The activation function adopts a modified Linear unit activation (ReLU) function, has the capability of guiding moderate sparsity, can accelerate the training speed of the network, improves the precision and avoids the problem of gradient disappearance.

The original layer parameters are initialized by a pre-training mode. Initializing a full-connected layer for classification by Gaussian distribution with the mean value of 0 and the standard deviation of 0.01; the fully connected layers for regression were initialized with a gaussian distribution with a mean of 0 and a standard deviation of 0.001, with the bias all initialized to 0.

During tuning training, N complete pictures are added firstly, and then R candidate frames selected from the N pictures are added. The R/N candidate frames of the same image are convolved to share calculation and memory, and the operation cost is reduced. The constitution of the R candidate frames is as follows: candidate boxes that overlap with some true value at [0.5,1] we define as foreground, accounting for 25% of the total; candidate boxes with a maximum value of [0.1,0.5) overlapping with the true value are defined as background and account for 75% of the total.

The ROI pooling layer averagely divides each feature frame according to a fixed size, performs maximum pooling on each block, can convert feature frames with different sizes on the feature map into data with uniform size, and sends the data into the next layer.

The Fast R-CNN network was multi-tasked with classification and regression losses as shown in FIG. 3. The cls _ score layer is used for classifying and outputting a K + 1-dimensional array p, wherein the K + 1-dimensional array p represents the probability of belonging to K types of objects and backgrounds, and K is set to be 7 according to the number of detection types; the bbox _ pred layer is used for adjusting the position of the candidate region, and outputs a 4 x K-dimensional array which represents parameters which should be translated and scaled when the candidate region belongs to the K classes respectively, and a separate regressor is trained for each class.

loss _ cls layer evaluation classification cost L_clsProbability p corresponding to the true class u_uDetermining:

L_cls＝-log p_u (1)

loss _ bbox estimation regression loss cost L_locComparing the predicted translation scaling parameters corresponding to the true class u

And the true pan zoom parameter v ═ (v ═ v_x,v_y,v_w,v_h) The difference between (1) and (2):

combining the classification loss and the regression loss, the total loss function in the network fine tuning stage is:

appointing u to be 0 for background classification, wherein a background candidate region, namely a negative sample, does not participate in regression loss, and does not need to perform regression operation on the candidate region; λ controls the balance of losses and regression losses, λ 1.

And training the network by using a random gradient descent method according to the loss function until L converges.

SVD decomposition is adopted in a Fast R-CNN network to accelerate calculation of a full connection layer; object classification and window regression are all realized through the full articulamentum, establish the full articulamentum input data as x, and the output data is y, and full articulamentum parameter is W, and once forward propagation is promptly:

y＝Wx (5)

and (3) carrying out SVD on W, so that the original forward propagation is decomposed into two steps:

y＝Wx＝U·(∑·V^T)·x＝U·z (6)

u and V are intermediate variables, and the decomposition can greatly reduce the calculated amount, thereby realizing the acceleration calculation of the full-connection layer.

Fig. 5 is a schematic diagram of the detection results of the road traffic sign, and it can be seen that the detection and recognition effects are good under general road conditions.

In summary, the invention provides a traffic sign detection and identification method based on Fast R-CNN. The method learns the features from the samples through deep learning, can extract the recessive features reflecting the essence of the data, has higher learning efficiency and recognition precision, improves the robustness of the detection algorithm, and effectively improves the accuracy of the detection of the road traffic signs. The problem of detection difficulty caused by factors such as serious shielding, serious abrasion, serious deformation, serious illumination change and the like of a road traffic sign can be solved to a great extent. Some of the method steps and flows herein may need to be performed by a computer to be implemented in hardware, software, firmware, or any combination thereof.

The above description is only a preferred embodiment of the present invention and should not be taken as limiting the scope of the invention, which is intended to include all equivalent changes, modifications, substitutions and the like in the appended claims. Those skilled in the art will recognize that changes and modifications may be made in the broader aspects without departing from the scope and spirit of the invention.

Claims

1. A road traffic sign identification method based on Fast R-CNN comprises the following steps:

carrying out image acquisition and pretreatment to manufacture a sample set;

extracting a plurality of candidate frames by adopting a Selective Search algorithm, finding a feature frame corresponding to each candidate frame in the feature map according to the mapping relation between the candidate frames in the original image and the feature map, and pooling each feature frame to a fixed size in an ROI (Region of Interest) pooling layer;

all the results are subjected to non-maximum suppression processing to generate final target detection and identification results, and the traffic signs are identified;

the image acquisition step specifically comprises:

starting a vehicle-mounted automobile data recorder, shooting road traffic video information in real time, and selecting a video image with a set video resolution ratio shot by the automobile data recorder;

screening the image set, and selecting 7 road traffic signs with more occurrence times from the image set;

the image preprocessing and sample set making steps specifically include:

in the selected image, the target area is taken out and zoomed to a fixed size to enhance the contrast, then the target area is subjected to contrast enhancement processing to form an original training set, and the test set is processed in the same way;

randomly taking out samples with the number equivalent to that of the test set from the new data set to form a verification set, and forming a final training set by the rest samples;

the selected traffic signs can be divided into 7 types, namely straight arrow, turning arrow, left arrow, right arrow, straight left arrow, straight right arrow and diamond mark line, wherein the numbers of the traffic signs are respectively 01, 02, 03, 04, 05, 06 and 07, and the identification results are output in the mode;

on the basis of VGG16, replacing the last pooling layer of the VGG-16 network with an ROI pooling layer, and replacing the last full-link layer and the softmax layer of the VGG-16 network with two parallel layers;

the resulting Fast R-CNN network structure is as follows: the multilayer structure comprises 13 convolutional layers, 4 pooling layers, 1 ROI pooling layer, 2 full-connection layers and two flat layers; inputting image samples with fixed size into a network through an input layer; for all convolutional layers, the convolutional kernel size is 3 × 3, step size is 1; for all pooling layers, 2 × 2 sampling sizes are used, and the pooling mode adopts maximum pooling; the activation function adopts a modified Linear unit activation (ReLU) function; initializing original layer parameters in a pre-training mode; initializing a full-connected layer for classification by Gaussian distribution with the mean value of 0 and the standard deviation of 0.01; the full-link layers for regression were initialized with a gaussian distribution with a mean of 0 and a standard deviation of 0.001, with offsets all initialized to 0;

during tuning training, adding N complete pictures, and then adding R candidate frames selected from the N pictures; the R/N candidate frames of the same image are subjected to convolution sharing calculation and memory, so that the calculation overhead is reduced; the constitution of the R candidate frames is as follows: candidate boxes that overlap with some true value at [0.5,1] we define as foreground, accounting for 25% of the total; candidate boxes with a maximum value of [0.1,0.5 ] overlapping with the true value are defined as the background and account for 75% of the total;

the ROI pooling layer averagely divides each feature frame according to a fixed size, performs maximum pooling on each block, can convert feature frames with different sizes on a feature map into data with uniform size, and sends the data into the next layer;

performing multitask training on a Fast R-CNN network, wherein a cls _ score layer is used for classifying, outputting a K + 1-dimensional array p which represents the probability of belonging to K-class objects and backgrounds, and setting K to be 7 according to the number of detection classes; the bbox _ pred layer is used for adjusting the position of the candidate region, outputting 4 x K-dimensional arrays which represent parameters which should be translated and scaled when the candidate region belongs to the K classes respectively, and training a single regressor for each class;

L_cls＝-logp_u (1)

appointing u to be 0 for background classification, wherein a background candidate region, namely a negative sample, does not participate in regression loss, and does not need to perform regression operation on the candidate region; λ controls the balance of losses and regression losses, λ is 1;

training the network by using a random gradient descent method according to the loss function until L converges;

y＝Wx (5)

y＝Wx＝U·(∑·V^T)·x＝U·z (6)