CN111582029B

CN111582029B - Traffic sign identification method based on dense connection and attention mechanism

Info

Publication number: CN111582029B
Application number: CN202010255951.9A
Authority: CN
Inventors: 褚晶辉; 黄浩; 吕卫
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2022-08-12
Anticipated expiration: 2040-04-02
Also published as: CN111582029A

Abstract

The invention discloses a traffic sign identification method based on dense connection and a channel attention mechanism, which comprises the following steps: constructing a data set and carrying out data preprocessing; building a traffic sign recognition neural network based on intensive connection and attention mechanism through a deep learning framework; inputting the pictures in the training set into a neural network, obtaining traffic sign category and position information through forward propagation, calculating errors with the information in the true value, performing backward propagation, and continuously updating network parameters until the errors are not reduced; inputting a picture with a traffic sign, loading the trained model, and outputting a traffic sign recognition result picture. The invention makes full use of the deep characteristics of the network, so that the network has stronger representation capability and can better mix global and local information.

Description

Traffic sign identification method based on dense connection and attention mechanism

Technical Field

The invention relates to the field of driving assistance systems and target detection, in particular to a traffic sign identification method based on intensive connection and attention mechanism.

Background

In the existing automatic driving system based on the visual method, the target detection is the most core task, including lane line detection, vehicle detection, non-motor vehicle detection, pedestrian detection, traffic sign detection and the like. When the vehicle runs on an actual road, the autonomous vehicle must comply with traffic regulations and need to make a judgment according to traffic signs and actual conditions of the road, but in the face of complex and changeable road scenes, the vehicle needs to obtain a 'prompt' for standard running from the surrounding environment, so a traffic sign detection algorithm is an indispensable part of an autonomous system. In the early stage of research, scholars at home and abroad mainly combine various image processing methods to solve the problem of traffic sign recognition because the traffic signs are regular in shape and bright in color. In recent years, with the continuous and intensive research on neural networks, deep learning methods are more applied to the field of automatic driving and occupy an important position.

The traffic sign recognition algorithm based on deep learning is high in accuracy, and can better cope with special conditions such as shielding, rain and snow weather and the like. At present, a common traffic sign identification method based on a convolutional neural network is fast-RCNN ^[1] 、SSD ^[2] And YOLO ^[3] And the like. The patent "a traffic sign recognition method based on SRCNN" (CN110321803A), can improve the low resolution image recognition accuracy rate and have the advantage that the calculated amount is little. The patent "a traffic sign recognition method based on improve SSD network" (CN110287806A) aims at improving the detection rate of SSD network to small target, realizes SSD algorithm to the detection of small traffic sign.

In the field of traffic sign identification, the anchor frame is often regarded as a group of prior frames with different sizes, and plays a great role in frames such as fast RCNN and SSD due to heuristic prior information. Neural networks typically require the generation of a very large set of anchor blocks, where only a small fraction overlaps with the true ground truth, which creates a large imbalance between positive and negative samples, slowing down the training speed.

Reference to the literature

[1]Ren S,He K,Girshick R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems.2015:91-99.

[2]Huang J,Rathod V,Sun C,et al.Speed/accuracy trade-offs for modern convolutional object detectors[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2017:7310-7311.

[3]Redmon J,Divvala S,Girshick R,et al.You only look once:Unified,real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2016:779-788.

Disclosure of Invention

The invention provides a traffic sign identification method based on a dense connection and attention mechanism, which abandons the anchor frame and uses a dense connection network DenseNet121 as a backbone network for feature extraction, introduces a channel attention mechanism, and is described in detail as follows:

a traffic sign identification method based on a dense connection and channel attention mechanism comprises the following steps:

constructing a data set and carrying out data preprocessing;

building a traffic sign recognition neural network based on intensive connection and attention mechanism through a deep learning framework;

inputting the pictures in the training set into a neural network, obtaining traffic sign category and position information through forward propagation, calculating errors with the information in the true value, performing backward propagation, and continuously updating network parameters until the errors are not reduced;

inputting a picture with a traffic sign, loading the trained model, and outputting a traffic sign recognition result picture.

The method for building the traffic sign recognition neural network based on the intensive connection and attention mechanism through the deep learning framework comprises the following steps of:

the coding path, the decoding path, the output prediction part and the four parallel branches jointly form a U-shaped coding and decoding network.

Further, the decoding path is divided into four parts,

each part consists of a residual block and a channel attention module; the feature map of the decoding path passes through the residual block and then enters the channel attention module to increase the weight of the effective channel and reduce the weight of the ineffective channel.

Wherein, the number of the parallel branches is four,

the first branch inputs the output of the fourth part of the coding path into a decoding path;

the second branch performs characteristic fusion on the output of the third part of the coding path and the output of the first part of the decoding path, and the characteristic fusion is used as the input of the second part of the decoding path;

the third branch performs characteristic fusion on the output of the second part of the coding path and the output of the second part of the decoding path, and inputs the output of the third part of the decoding path;

and the fourth branch performs characteristic fusion on the output of the first part of the encoding path and the output of the third part of the decoding path, and inputs the output of the fourth part of the decoding path.

Further, the output prediction part is formed by connecting three other branches in parallel,

after the fifth branch passes through the two rolling blocks, a feature map of N channels is obtained, wherein N is the number of the classes of the traffic sign, and the probability that the traffic sign belongs to the N classes is calculated through the feature map of the N channels;

the sixth branch circuit can obtain a feature map of two channels after passing through the two rolling blocks, two points X and Y can be obtained by calculating the two channels of the feature map, and one coordinate, namely the coordinate of the central point of the traffic sign, can be predicted;

and the last branch also passes through the two rolling blocks to obtain a feature map of two channels, and the two channels of the feature map are calculated to obtain two values w and h, namely the width and the height of the traffic sign detection frame.

The technical scheme provided by the invention has the beneficial effects that:

1. adopting DenseNet121 as a backbone network to extract features, and adopting a dense connection mode, wherein each layer can receive all the layers in front of the layer as additional input of the layer; the DenseNet realizes the characteristic reuse, has smaller parameters and more efficient calculation, and simultaneously promotes the reverse propagation of the gradient, so that the network is easier to train;

2. by adopting a U-shaped coding and decoding network, the deep layer characteristics and the shallow layer characteristics of the network can be fused together to detect the traffic signs with different scales; repeated coding and decoding operations fully utilize deep characteristics of the network, so that the network has stronger representation capability and can better mix global and local information;

3. in a decoding network, a channel attention module is introduced before each feature fusion branch, useless channel information is fully filtered, beneficial information is reserved and fused into a feature map, and the accuracy of traffic sign identification is improved; meanwhile, a large number of residual error structures are used in the decoding network, so that the nonlinear capacity of the network can be improved, and the problem of network degradation is solved;

4. the channel attention module of the invention simultaneously uses average pooling and maximum pooling, and combines the two pooling together to increase the weight of the effective channel and reduce the weight of the ineffective channel.

Drawings

FIG. 1 is a diagram of an overall neural network architecture for a traffic sign recognition method based on a dense connection and attention mechanism;

FIG. 2 is a schematic diagram of a neural network structure of a Dense connection module Dense Block;

FIG. 3 is a schematic diagram of a neural network structure of an attention module;

fig. 4 is a diagram of the recognition effect of the traffic sign recognition method based on the dense connection and attention mechanism.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

The innovation points of the invention are as follows:

1) the method uses the dense connection network as a backbone network to extract the characteristics, and greatly reduces the parameter quantity of the network and relieves the generation of the gradient explosion problem to a certain extent through characteristic reuse and bypass arrangement;

2) the characteristic diagrams output by different modules of the DenseNet121 are fused, multi-scale information is fully utilized, the characteristic diagram output finally by the neural network has information of targets with various sizes, large-size targets can be distinguished easily, and the recognition rate of small targets can be improved;

3) the network utilizes a channel attention mechanism, and calculates the importance degree between channels through a full connection layer, thereby filtering unimportant channel values. The neural network concentrates attention on important parameters, selects key information, ignores other unimportant information, increases the weight of beneficial parameters and reduces the weight of useless parameters;

4) the traffic sign is identified by using an anchor-free method, the phenomenon of unbalanced quantity of positive and negative samples caused by an anchor frame is avoided, and the accuracy of traffic sign identification is improved.

Example 1

The embodiment of the invention provides a traffic sign identification method based on dense connection and channel attention mechanism, and referring to fig. 1, the method comprises the following steps:

101: constructing a data set and carrying out data preprocessing;

the step 101 is specifically: a data set was downloaded, the data source being the chinese traffic sign data set TT100K (Tsinghua-Tencent 100K) published by the university of qinghua, which was taken from a street view panorama in Tencent. The training set of data sets contained 6107 pictures, and the test set contained 3073 pictures with image sizes of 2048 × 2048 pixels. The invention selects the category with the frequency of appearance more than 100 in the data set for training, and the category has 45 categories in total.

102: building a traffic sign recognition neural network based on dense connection and attention mechanism through a deep learning frame PyTorch;

the traffic sign recognition neural network based on the dense connection and attention mechanism comprises four parts, namely an encoding path, a decoding path, an output prediction part and four parallel branches, and a U-shaped encoding and decoding network is formed together.

The coding path is composed of a feature extraction network DenseNet121, and the input image is firstly subjected to primary extraction of shallow features through a 7 × 7 convolution module. The coding path is divided into four parts, and each part is composed of a rolling block and a dense connecting block. The number of dense connection layers included in the four dense connection blocks is 6, 12, 24, and 16, respectively, for sufficient extraction of image features. A Pool layer is arranged between every two adjacent dense connecting blocks for down-sampling, so that the effect of expanding the receptive field of the network is achieved, and characteristic diagrams of various scales are obtained at the same time. The feature map obtained after each down-sampling is fused with the feature map with the same scale in the decoding path through the parallel branch, so that the network obtains more global feature information.

The decoding path is divided into four parts, each part consisting of a Residual Block and a channel attention Block. The feature map of the decoding path will go through the residual block and then enter the channel attention module, so as to increase the weight of the effective channel and reduce the weight of the ineffective channel.

The number of the parallel branches is four, and the output of the fourth part of the coding path is input into the decoding path by the first branch 1; the second branch 2 path carries out feature fusion on the output of the third part of the coding path and the output of the first part of the decoding path, and the feature fusion is used as the input of the second part of the decoding path; the third branch 3 performs characteristic fusion on the output of the second part of the coding path and the output of the second part of the decoding path, and inputs the output of the third part of the decoding path; the fourth branch 4 performs feature fusion on the output of the first part of the encoding path and the output of the third part of the decoding path, and inputs the output to the fourth part of the decoding path.

After the coding and decoding network, the feature map is restored to the original size and contains various scales and deep semantic features, and the high-dimensional feature map is called a heat map and can predict traffic signs with various sizes. Sending the heat map into an output prediction part, wherein the output prediction part is formed by connecting three branches in parallel, a fifth branch can obtain a characteristic map of an N channel after passing through two convolution blocks, N is the number of the types of the traffic signs, and the probability that the traffic signs belong to the N types is calculated through the characteristic map of the N channel, so that the types of the traffic signs can be predicted; the sixth branch circuit can obtain a feature map of two channels after passing through the two rolling blocks, two points X and Y can be obtained by calculating the two channels of the feature map, and one coordinate, namely the coordinate (X, Y) of the central point of the traffic sign, can be predicted; the last branch (seventh branch) will also get a two-channel feature map after passing through two convolution blocks, and two values w and h, i.e. the width and height of the traffic sign detection frame, will be obtained by calculating the two channels of the feature map.

103: training a model;

and inputting the pictures in the training set into a traffic sign recognition neural network based on dense connection and attention mechanism, obtaining traffic sign category and position information through forward propagation, performing backward propagation with information calculation errors in a ground route, and continuously updating network parameters until the errors are not reduced. And storing the trained network parameters as a model.

104: and inputting a picture with a traffic sign, loading the trained model in the third step, and outputting a traffic sign recognition result picture.

Example 2

The scheme of example 1 is further described below with reference to specific examples, which are described in detail below:

201: constructing a data set and carrying out data preprocessing:

(1) the invention uses a TT100K (Tsinghua-Tencent 100K) data set which is disclosed and is divided into a training set and a testing set. The training set contained 6107 pictures, and the test set contained 3073 pictures, each of 2048 pixels by 2048 pixels. TT100K is captured using a street view panorama in Tencent, covering a total of 180 traffic sign categories in China, but many of these categories are rare and occur less frequently in the data set. The invention adopts 45 types of traffic signs with the frequency of occurrence more than 100 in the data set to train.

(2) Since the whole image cannot be directly trained and trained due to the GPU video memory limitation, the picture in (1) is clipped, and the training set of 2048 × 2048 is clipped to 512 × 512 pixels. The data set is then processed and converted to json files in standard coco data format for network reading.

202: a traffic sign recognition neural network based on dense connection and attention mechanism is built through a deep learning framework PyTorch, the traffic sign recognition neural network is of a U-shaped structure, and the detailed description is given below:

(1) the main structure of the neural network is shown in fig. 1, and a U-shaped coding and decoding network is formed by four parts, namely a coding path formed by a feature extraction network densnet 121, a decoding path formed by a channel attention module and a Residual Block, an output prediction part and four parallel branches.

a) The coding path is composed of a feature extraction network DenseNet121 and is divided into four parts: the first part consists of a 7 × 7 convolution Block and a Dense connection Block density Block, the 7 × 7 convolution Block is for the preliminary extraction of shallow features, the 7 × 7 convolution Block contains 1 convolution Layer with convolution kernel size of 7 × 7, expansion rate of 1 × 1, and number of channels of 64, 1 Batch Norm Layer, 1 Relu Layer, and one Pool Layer, the step size of 7 × 7 convolution is 2, the Pool Layer is the largest pooling Layer, and the number of Dense connection Layer density in the Dense connection Block density Block is 6(L ═ 6); the second part consists of a 3 × 3 convolution Block and a Dense connection Block Dense Block, wherein the 3 × 3 convolution Block contains 1 convolution Layer with the convolution kernel size of 3 × 3, the expansion rate of 1 × 1 and the channel number of 128, 1 Batch Norm Layer, 1 Relu Layer and one Pool Layer, the step size of the 3 × 3 convolution is 1, the Pool Layer is the maximum pooling Layer, and the number of Dense connection layers Dense Layer in the Dense connection Block Dense Block is 12(L ═ 12); the third part consists of a 3 × 3 convolution Block and a Dense connection Block Dense Block, wherein the 3 × 3 convolution Block contains 1 convolution Layer with the convolution kernel size of 3 × 3, the expansion rate of 1 × 1 and the channel number of 256, 1 Batch Norm Layer, 1 Relu Layer and one Pool Layer, the step size of the 3 × 3 convolution is 1, the Pool Layer is the maximum pooling Layer, and the number of Dense connection layers Dense Layer in the Dense connection Block Dense Block is 24(L ═ 24); the fourth part consists of a 3 × 3 convolution Block and a Dense connection Block Dense Block, the 3 × 3 convolution Block contains 1 convolution Layer with convolution kernel size of 3 × 3, expansion rate of 1 × 1 and channel number of 512, 1 Batch Norm Layer, 1 Relu Layer and one Pool Layer, the step size of 3 × 3 convolution is 1, the Pool Layer is the maximum pooling Layer, and the number of Dense connection layers Dense Layer in the Dense connection Block Dense Block is 16(L ═ 16). Each convolution block is provided with a Pool layer for down-sampling, so that the receptive field of the network is enlarged, and characteristic maps with various scales are obtained. The feature map obtained after each down-sampling is fused with the feature map with the same scale in the decoding path through the parallel branch, so that the network obtains more global feature information.

b) The decoding path is divided into four parts, each part consisting of a Residual Block and a channel attention Block. The feature map of the decoding path will go through the residual block and then enter the channel attention module in order to increase the weight of the valid channel and decrease the weight of the invalid channel. The four Residual blocks are identical in structure, each Residual Block comprises two series-connected intra-frame convolutions, and each intra-frame convolution Block comprises 1 convolution layer with convolution kernel size of 3 × 3, expansion rate of 1 × 1, step size of 1 and channel number of 256, 1 Batch Norm layer and 1 Relu layer. And a channel attention module is arranged behind each residual block, so that useless channel information is fully filtered, and beneficial information is reserved and fused into the feature map. After the coding and decoding network, the feature map is restored to the original size and contains various scales and deep semantic features, and the feature map is called a heat map and can predict traffic signs with various sizes.

c) The number of parallel branches is 4: the first branch inputs the output of the fourth part of the coding path into the decoding path after passing through a 1 x 1 convolution block; the second branch performs characteristic fusion on the output of the third part of the coding path and the output of the first part of the decoding path after passing through a 1-to-1 convolution block, and the output of the third part of the coding path and the output of the first part of the decoding path are used as the input of the second part of the decoding path; the third branch circuit conducts characteristic fusion on the output of the second part of the coding path and the output of the second part of the decoding path after passing through the convolution block of 1 x 1, and the output is input into the third part of the decoding path; and the fourth branch circuit performs characteristic fusion on the output of the first part of the encoding path and the output of the third part of the decoding path after the output of the first part of the encoding path passes through a 1 × 1 convolution block, and inputs the output into the fourth part of the decoding path. Each of the four 1 x 1 convolution blocks contains 1 convolution layer with a convolution kernel size of 1 x 1, an expansion rate of 1 x 1, a step size of 1, 1 Batch Norm layer, and 1 Relu layer.

d) The output prediction part firstly passes through a 3 x 3 convolution block, contains 1 convolution layer with convolution kernel size of 3 x 3, expansion rate of 1 x 1, step length of 1 and channel number of 256, 1 Batch Norm layer and 1 Relu layer, and then is divided into three stages of branches connected in parallel. Each branch consists of a convolution block of 3 × 3 and a convolution block of 1 × 1, the convolution blocks of 3 × 3 of the three branches are identical, and each branch comprises 1 convolution layer with a convolution kernel size of 3 × 3, an expansion rate of 1 × 1 and a channel number of 128, 1 Batch Norm layer and 1 Relu layer. The 1 × 1 convolution block of the fifth branch contains 1 convolution layer with convolution kernel size of 1 × 1, expansion rate of 1 × 1, step length of 1, channel number of 45, 1 Batch Norm layer and 1 Relu layer, the branch can obtain a characteristic diagram of N channels after passing through the two convolution blocks, N is the category number of the traffic sign, and the probability that the traffic sign belongs to N categories is calculated through the characteristic diagram of the N channels, so that the category of the traffic sign can be predicted; the other two branches (the sixth branch and the seventh branch) have the same 1 × 1 convolution block, and each branch comprises 1 convolution layer with convolution kernel size of 1 × 1, expansion rate of 1 × 1, step length of 1 and channel number of 2, 1 Batch Norm layer and 1 Relu layer, the sixth branch and the seventh branch can predict the central point coordinate of the traffic sign and the width and height information of the traffic sign detection frame respectively, wherein the sixth branch can obtain a feature map of two channels after passing through the two rolling blocks, two points X and Y can be obtained by calculating the two channels of the feature map, a coordinate can be predicted, i.e. the coordinates (X, Y) of the center point of the traffic sign, the seventh branch will also get a two-channel feature map after passing through the two convolution blocks, two values w and h, i.e. the width and height of the traffic sign detection box, are obtained by calculating the two channels of the feature map. Through the prediction information of the fifth branch, the sixth branch and the seventh branch, the type of the traffic sign can be identified, and the position information of the traffic sign in the picture can be calculated.

(2) The Dense connection module Dense Block (L ═ 6) has a structure shown in fig. 2, and includes 6 Dense connection layers Dense Layer. The Dense connection module Dense Block is generally composed of L Dense Layer Dense connections. The neural network adopts DenseNet121 as a coding path, and the number of dense connection layers contained in four dense connection blocks is respectively 6, 12, 24 and 16, so that the neural network is used for fully extracting image characteristics. Each densely connected layer is composed of one 3 × 3 convolution block and one 1 × 1 convolution block, the 3 × 3 convolution block contains 1 convolution layer with convolution kernel size of 3 × 3, expansion rate of 1 × 1 and step size of 1, 1 Batch Norm layer and 1 Relu layer, and the 1 × 1 convolution block contains 1 convolution layer with convolution kernel size of 1 × 1, expansion rate of 1 × 1 and step size of 1, 1 Batch Norm layer and 1 Relu layer.

(3) The channel attention module is divided into three branches (i.e., the eighth branch, the ninth branch and the tenth branch) as shown in fig. 3. The eighth branch performs global maximum pooling (MaxPool) on the input feature map, performs Linear transformation (Linear), and performs normalization operation through a Sigmoid function; the ninth branch performs global average pooling (AvgPool) on the input feature map, performs Linear transformation (Linear), performs normalization operation through a Sigmoid function, and adds the normalized operation to the output of the eighth branch; and after the outputs of the eighth branch and the ninth branch are added, normalizing by the Sigmoid function again, and multiplying by the output characteristic diagram of the tenth branch, wherein the characteristic diagram output by the tenth branch is the original input characteristic diagram of the attention module.

203: training a model;

and inputting the image which is cut in the first step into the traffic sign recognition neural network which is built in the second step and is based on the dense connection and attention mechanism, and obtaining the category information of the traffic sign and the position information of the detection frame through forward propagation. And calculating the error of the traffic sign category and position information predicted by the neural network and the label information in the ground route, reversely propagating the error term from the output layer to the hidden layer by layer, updating network parameters until the network parameters reach the input layer, and continuously feeding back and optimizing by using an ADAM (adaptive dynamic adaptive analysis) optimizer until the error is not reduced any more.

The batch _ size of the network is set to 4, namely 4 traffic sign pictures of 512 x 512 are trained each time; the epoch is set to 110, i.e., the entire network requires 110 rounds of training. And storing the trained network parameters as a model.

204: and inputting a traffic sign picture to be detected and identified, and loading the trained model in the third step, namely outputting an identification result picture, as shown in fig. 4.

205: the invention uses precision ratio (AP) and recall ratio (AR) to measure the effect of the algorithm. 3073 test set pictures are input for detection and calculation, and then AP is 95.5 and AR is 99.6 are calculated.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A traffic sign identification method based on a dense connection and channel attention mechanism is characterized by comprising the following steps:

constructing a data set and carrying out data preprocessing;

inputting a picture with a traffic sign, loading the trained model, and outputting a traffic sign recognition result picture;

wherein, the building of the traffic sign recognition neural network based on the intensive connection and attention mechanism through the deep learning framework comprises the following steps:

the coding path, the decoding path, the output prediction part and the four parallel branches jointly form a U-shaped coding and decoding network;

wherein the decoding path is divided into four parts,

each part consists of a residual block and a channel attention module; the feature map of the decoding path firstly passes through the residual block and then enters the channel attention module to increase the weight of the effective channel and reduce the weight of the ineffective channel.

2. The traffic sign recognition method based on the dense connection and channel attention mechanism as claimed in claim 1, wherein the number of the parallel branches is four,

3. The traffic sign recognition method based on the dense connection and channel attention mechanism as claimed in claim 1, wherein the output prediction part is formed by connecting three other branches in parallel,