CN110188705B

CN110188705B - Remote traffic sign detection and identification method suitable for vehicle-mounted system

Info

Publication number: CN110188705B
Application number: CN201910474059.7A
Authority: CN
Inventors: 刘志刚; 杜娟; 田枫; 韩玉祥; 高雅田; 张可佳
Original assignee: Northeast Petroleum University
Current assignee: Northeast Petroleum University
Priority date: 2019-06-02
Filing date: 2019-06-02
Publication date: 2022-05-06
Anticipated expiration: 2039-06-02
Also published as: CN110188705A

Abstract

The invention relates to a remote traffic sign detection and identification method suitable for a vehicle-mounted system, which comprises the following steps: 1. preprocessing a traffic sign image sample set; 2. constructing a lightweight convolutional neural network to complete the convolutional feature extraction of the traffic sign; 3. constructing an attention feature map through a channel-space attention module embedded into a lightweight convolutional neural network; 4. generating a candidate region of a target by adopting a region generation network RPN; 5. introducing context area information to a target candidate area generated by RPN, and enhancing the mark classification characteristics; 6. sending the characteristic vectors into a full connection layer, and outputting the types and positions of the traffic signs; 7. establishing an attention loss function, and training a FL-CNN model; 8. repeating the steps from 2 to 7 to finish the sample training of the FL-CNN model; 9. and repeating the steps from 2 to 6 to complete the detection and identification of the traffic signs in the actual scene. The invention realizes the detection and identification of the long-distance traffic sign with the accuracy reaching 92 percent.

Description

Remote traffic sign detection and identification method suitable for vehicle-mounted system

The technical field is as follows:

the invention relates to the field of intelligent traffic oriented to unmanned driving and auxiliary driving, solves a long-distance detection and identification method of a road traffic sign, and particularly relates to a long-distance traffic sign detection and identification method suitable for a vehicle-mounted system.

Secondly, background art:

in the field of intelligent transportation, traffic sign detection and identification are important research problems of unmanned driving, auxiliary driving and other systems. Many research works are carried out at home and abroad, but the method still has great defects and cannot be practically applied to practice. The reason is as follows: (1) the traditional detection and identification method designed based on characteristics such as color and shape has poor robustness in the face of sign deformation, motion blur, weather and other conditions in actual traffic scenes, and is difficult to be applied in practice; (2) in the existing detection and identification method based on deep learning, parameter files derived from a model are huge, and the method needs larger memory and hard disk storage during operation, cannot be directly operated on a vehicle-mounted system with lower power consumption and hardware performance, and has poorer practicability; (3) in the aspect of data sets, some methods directly use the data sets shot by the user, the change of the data quantity and the marks is less, and the model is difficult to be practically applied in practice; in addition, some methods are based on the published german traffic sign data sets GTSRB and GTSDB, but the size of the signs in these data sets is large and the number of detected sign types is small. Since the data set has a great influence on the performance of the model, these methods all belong to the detection and identification in a short distance at present.

The vehicle-mounted system in unmanned driving and auxiliary driving belongs to an embedded system, so that the power consumption and hardware performance of a power supply are low, and a huge deep learning model cannot be directly operated. Meanwhile, the remote detection identification mark can provide more response time for the automobile driving, and plays an important role in improving the intelligent driving safety of the automobile. In the technology, the long-distance traffic sign detection and identification belongs to the problem of small target identification in a complex background, which is a difficult problem in the field of computer vision at present, and the existing method is difficult to obtain higher detection and identification precision.

Thirdly, the invention content:

the invention aims to provide a long-distance traffic sign detection and identification method suitable for an on-vehicle system, which is used for solving the problem of low precision when the existing short-distance detection and identification method is used for detecting and identifying long-distance traffic signs.

The technical scheme adopted by the invention for solving the technical problems is as follows: the remote traffic sign detection and identification method suitable for the vehicle-mounted system comprises the following steps:

step 1, preprocessing a traffic sign image sample set;

step 2, constructing a lightweight convolutional neural network to complete the convolutional characteristic extraction of the traffic sign;

(1) the method comprises the steps that the combined mapping of a channel and a space of the original VGG-16 standard convolution is separated into a single mapping mode of the channel and the space by utilizing deep separation convolution, the parameter number of a model and the storage of hard disk space are reduced, a lightweight convolutional neural network comprises 5 convolutional layers in total, each convolutional layer comprises two parts of deep convolution and point convolution, and ReLU is used as an activation function;

(2) in the lightweight convolutional neural network, the computation amount of the deep separation convolution and the original standard convolution is compared as follows:

let the convolution kernel be (D)_K,D_KC), wherein D)_KThe width and height of a convolution kernel, and C is a channel of the convolution kernel; in the convolution calculation process, the depth separation convolution is N (D) calculated by the original cross-channel_K,D_KM) standard convolution conversion to M (D)_K,D_K1) and N (1,1, M) point convolutions across channels, wherein a depth convolution is a single channel computation and a point convolution is a cross-channel computation; the input feature map is recorded as { D_F,D_FM, the output characteristic diagram is { D }_F,D_FN }, wherein D_FRepresenting the width and height of the feature map, the amount of computation for each convolution is as follows:

the calculated amount of the standard convolution is: count _ s ═ D_K×D_K×M×N×D_F×D_F

② deep convolutionThe calculated amount is as follows: count _ D ═ D_K×D_K×M×D_F×D_F

And the calculated amount of the point convolution is as follows: count _ p ═ mxnxd_F×D_F

Therefore, the comparison relationship between the computation amounts of the deep separation convolution and the standard convolution is as follows, and the computation amount is reduced compared with the original standard convolution by using the deep separation convolution every time

Doubling;

step 3, constructing an attention characteristic diagram through a channel-space attention module embedded into the lightweight convolutional neural network;

the method comprises the steps of embedding a channel-space attention module into deep separation convolutional layers, and performing characteristic attention of two dimensions of a channel and a space on output characteristic graphs of each deep separation convolutional layer; wherein the channel attention is most meaningful in paying attention to 'what' in an image by using the interrelation and the importance degree among channels; spatial concerns are the location features of the target in the image, "where" is more effective for image detection identification;

the output characteristic diagram of a certain depth separation convolution layer is U ═ U [ [ U ] ]₁，u₂，…，u_C]，

C is the number of channels,

for a real matrix, H and W are the height and width of the profile, respectively. After passing through the channel-attention module, constructed attention featuresIn the figure, Y is ═ Y₁，y₂，…，y_C]，

The specific calculation process of the process is as follows:

step 3.1 channel attention force

For calculating the channel attention, firstly, compressing the spatial dimension of each channel in a feature map along the channel direction, and aggregating spatial information by using three global pooling modes of maximum, average and random respectively, wherein the maximum and average pooling respectively retains the texture and background features of an image, and the random pooling is between the maximum and average pooling;

first, according to the global maximum pooling, each u is respectively_iCompression to channel attention mask component

It is defined as:

compressing each u according to global average pooling and global random pooling_iTo channel attention force mask component S_meanAnd S_stoWherein it is defined as:

wherein

Secondly, three channel attention force mask components constructed by pooling compression are respectively used as the input of a multilayer perceptron model, and through point-by-point phase of weight parameters and the mask componentsThe multiply, accumulate, and activate functions complete the aggregation, further increasing the non-linearity. Channel attention mask S ═ S for feature graph U₁,s₂,…,s_C]Is defined as follows:

S＝σ(W₁δ(W₀S_max)+W₁δ(W₀S_mean)+W₁δ(W₀S_sto))

where σ is the sigmoid function and δ is the ReLu function. W₀And W₁These parameters are shared for the three channel attention mask components, which are weights of the multi-layered perceptron model;

finally, expanding the channel attention mask to the original input feature graph U, calibrating the weight of each channel in the feature graph U according to the mask, and recording a new feature graph after channel attention mapping as

The specific definition is as follows:

step 3.2 spatial attention

In order to calculate the spatial attention mapping and construct the characteristic relationship between pixels or regions in the characteristic diagram, the channel number of the characteristic diagram is firstly compressed, and a group of pointwise convolutions are adopted to carry out the original input characteristic diagram of the depth separation convolution layer

Cross-channel aggregation, the aggregated feature map is recorded as

Secondly, because the characteristics of the feature maps of different layers are greatly different, the shallow feature map has higher resolution, and the deep feature map has the opposite, and contains more abstract semantic features; therefore, in the invention, when the spatial attention is constructed, in order to reduce parameters and calculation amount, the mask of the spatial attention is in a shallow layer andthe deep characteristic map is respectively carried out according to regions and pixels; spatial attention mask of

It is defined as follows:

N＝Softmax(Conv(M,o,k,s,p))

wherein Conv (·) represents convolution operation, an output channel o is 1, a convolution kernel size k of the shallow convolution layer is 1, a convolution kernel size k of the deep convolution layer is 3, s is 1 and p is 0, which are respectively step length and filling of convolution, and furthermore, in order to eliminate different influences of different feature map scales, a spatial attention mask is normalized by utilizing a Softmax function;

step 3.3 attention force feature map

Feature maps after channel attention force mapping

Based on the obtained data, performing spatial attention mask again

The expansion of (2). Calibrating the spatial attention of each channel of the feature map X according to the spatial attention mask, and finally generating an output feature map of the deep separation convolutional layer

As input to the next depth-separated convolutional layer, it is defined as:

wherein

Represents point-by-point multiplication;

step 4, on the basis of the attention feature map, generating a candidate region of a target by using a region generation network RPN;

positioning areas where traffic signs may appear in a traffic scene, and then classifying the signs according to the areas by the FL-CNN;

step 5, introducing context area information to a target candidate area generated by the RPN, and enhancing the classification characteristics of the marks;

and for the target candidate region given in the step four, only including partial features of the traffic sign, introducing the spatial neighboring features of the target candidate region to enhance the classification features of the sign, and specifically comprising the following steps:

(1) for convenience of description, a certain target candidate region is denoted as p ═ p (p)_x,p_y,p_w,p_h) Wherein (p)_x,p_y) Is the center of the region (p)_w,p_h) Representing the width and height of the region, using a scale factor on the final depth-separated convolution output attention profile

And

creating a context area

The center coordinates are the same as the corresponding target candidate region; the relationship between the context area and the candidate area can be described as follows, where i is the serial number of the context area;

(2) for each target candidate area and its context area, dividing into 7 parts in horizontal and vertical directions by RoI-Pooling, and performing maximum value Pooling down-sampling processing on each part, wherein even if the areas are different in size, the output dimension is kept consistent, and 3 feature vectors with fixed size of 7 × 7 × 512 are generated;

(3) in the spatial dimension, the feature vectors are connected in series to form a 3 multiplied by 7 multiplied by 512;

(4) compressing the feature vectors formed in the convolution pair (3) of 1 × 1 to 7 × 7 × 512, enabling the dimensions of the feature vectors introduced into the context region to meet the node requirements of the full-connection layer, learning the nonlinear relation between the background and the target by utilizing the convolution of 1 × 1, when the introduced context region contains complex backgrounds, the convolution parameters restrain the backgrounds, and on the contrary, if the local features of the target are introduced, the convolution parameters strengthen the features;

step 6, sending the feature vectors into a full-connection layer for classification and regression, and outputting the classes and positions of the traffic signs;

two full-Connected networks (FC) are adopted, the first FC Network is used for classifying traffic signs, wherein 4096 nodes are hidden layers, 44 nodes are output, each output node represents a traffic sign, the value range of the traffic sign is (0,1), and the largest output node is taken as the traffic sign category during classification; the second FC network is used for position regression of the traffic sign, wherein the hidden layer is 4096 nodes, 4 nodes are output and respectively represent the coordinate and the width of the center point of the traffic sign;

step 7, establishing a attention loss function and training a FL-CNN model;

in order to ensure that the model obtains full training and improve the generalization capability of the model, an attention loss function is established, the difficult and easy samples are effectively distinguished, the loss of the easy-to-separate samples is inhibited, and the loss of the difficult-to-separate samples is enhanced; the FL-CNN model training comprises two parts, namely an RPN network and a full-connection layer network, wherein the loss of the RPN network comprises two classification losses and regression losses, and the loss of the full-connection network comprises multi-classification loss and regression loss;

(1) the attention loss function of the RPN network is as follows:

wherein p is_iIndicating the predicted probability that the ith anchor is the target object,

a real tag representing the object. t is t_iIs a directionAnd the quantity comprises the center coordinates and width and height information of the prediction frame.

An information vector representing a real box. N is a radical of_clsDenotes the total number of anchors, N_regThe size of the characteristic diagram is shown, and lambda is an adjusting coefficient and can be taken

L_regRepresents the regression loss of all the bounding boxes,

the attention two-classification loss provided by the invention is specifically defined as follows:

wherein sigma is sigmoid function, the prediction probability of foreground sample is-log sigma (x), the prediction probability of background sample is-log sigma (-x), and K is constant. The loss function has the following characteristics: if the sample belongs to an easily separable sample, — log σ (x) → 1 or-log σ (-x) → 0, that is, σ (x) → 1/e or σ (-x) → 1, then when K is taken as a large value, the foreground sample is lost

Loss accommodation coefficient σ (-Kx) → 0 in (1), background sample loss

Loss adjusting coefficient σ (Kx) → 0; if the sample belongs to the difficultly-divided sample, the loss adjusting coefficients of the foreground sample and the background sample are respectively as follows: sigma (-Kx) → 1 and sigma (Kx) → 1, so that the attention loss function effectively distinguishes the difficult and easy samples, and the RPN learning and training are enabled to pay more attention to the difficult and easy samples by inhibiting the loss of the easy and easy samples, so that the RPN network is ensured to be fully trained;

(2) the attention loss function for a full-connectivity layer network is as follows:

where δ is the softmax function, which is similar to the RPN, and includes both multi-classification loss and regression loss, where

Is concerned with multi-classification loss, the functional property of which is consistent with concerned with two-classification loss, when the prediction probability of the sample is-log delta (x)_k) Weight δ (-Kx) → 1_k) → 0, predicted probability of opposite sample-log δ (x)_k) Weight δ (-Kx) on → 0_k)→1；

Step 8, repeating the step 2 to the step 7 to finish the sample training of the FL-CNN model;

and 9, starting the color camera to photograph the actual traffic scene, preprocessing the scene before inputting the model, setting the resolution to be 2048 multiplied by 2048, inputting the scene into the FL-CNN model, and repeating the steps 2 to 6 to finish the detection and identification of the traffic sign of the actual scene.

The step 1 in the scheme is specifically as follows:

(1) adopting a Tsinghua university and Tencent 100K data set jointly published by Tencent corporation, and selecting 44 types of common traffic signs as remote detection and identification objects;

(2) dividing the Tsinghua-Tencent 100K data set into a training set and a testing set according to the proportion of 1: 2;

(3) in order to ensure the sample balance during the FL-CNN model training, the scene instances of each type of traffic signs in the training set are more than 100, and if the scene instances of a certain type of signs are less than 100, a repeated sampling method is adopted for filling.

The step 4 in the scheme is specifically as follows:

(1) on an attention characteristic graph output by the last depth separation convolution, taking each pixel as an anchor point, and adopting 3 proportions and 3 sizes of 1:1, 1:2 and 2:1 to generate 9 anchors on an original traffic scene, wherein the 3 sizes are 4, 8 and 16 respectively;

(2) removing anchors frames that exceed the original input image boundaries;

(3) removing anchor frames which are repeated more by adopting a maximum value inhibition method, wherein the threshold value is 0.7;

(4) from the intersection ratio IoU of the anchor box to the true target in the sample, positive and negative samples were determined, where IoU >0.7 are positive samples, IoU <0.3 are negative samples, and between 0.3 and 0.7 the anchor box was removed again, where IoU is calculated as follows:

(5) according to the translation invariance, each anchor frame corresponds to an area suggestion frame on the fusion feature map;

(6) and after all the regional suggestion frames pass through the fully-connected layer of the regional generation network RPN, obtaining a target candidate region.

Has the advantages that:

1. the invention mainly adopts the deep learning technology to solve the remote detection and identification of the road traffic sign, the method can be directly operated on a vehicle-mounted system with lower power consumption and hardware performance, and the test result shows that the method effectively saves the computing resource, the size of the model is only 76Mb, the identification precision reaches 92 percent, and the method is suitable for the remote traffic sign detection and identification requirements of the vehicle-mounted system.

2. The invention establishes a light-weight concerned convolutional neural network, and the reason analysis and the advantages of the establishment are as follows:

reason analysis: the existing traffic sign detection model directly utilizes the convolutional neural network to extract image features, although the learning and extraction of the features can be automatically completed from a mass data set in this way, it is worth noting that the model parameters of the convolutional neural network occupy a large storage space (for example, the VGG-16 network reaches 527Mb) after being derived, and a large memory and power consumption are required during operation. The vehicle-mounted system belongs to an embedded low-power-consumption system, and the hardware performance is low, so that the huge models cannot be directly operated to detect the traffic signs. In the prior art, a detection and identification model is operated on a remote server, a vehicle-mounted camera shoots a traffic scene and then transmits the traffic scene to a server through a network, and then receives a detection and identification result of the server through the network.

And (3) advantage analysis: aiming at the problems, the invention designs a light Convolutional Neural Network (FL-CNN) based on attention. The model has the following advantages:

(1) in the process of extracting features by convolution, for each convolution layer, the FL-CNN uses deep separation convolution to replace the original standard convolution, greatly reduces the number of model parameters, and compresses the storage space of the model to 76Mb, thereby being suitable for a vehicle-mounted system;

(2) in the detection stage of the FL-CNN, a channel-space attention module is invented, and the original input feature map of the depth separation convolutional layer is compressed, masked and expanded to construct an attention feature map with feature suppression or enhancement, so that the model can quickly and accurately detect a target in a complex background. The attention feature map respectively reinforces the salient features and the region position information of the image through two dimensions of a channel and a space, so that not only can the model pay attention to the information during calculation and save calculation resources, but also the detection speed and the detection precision can be effectively improved through the attention channel features and the region reinforcement during detection of the model;

(3) the FL-CNN invents a context-focused mechanism in the identification phase. The target area extracted in the detection stage often has the phenomenon of incomplete mark area, so that part of critical information of the mark is lost. Because the concern context mechanism realizes the local information introduction of the traffic sign through the context of the target area, the classification characteristic of the sign is enhanced, and the incomplete traffic sign area in the detection stage is effectively prevented. Meanwhile, the nonlinear relation between the target and the context region can be dynamically learned by adopting point convolution, and if background interference information is introduced, the inhibition of the interference information is completed by reducing the value of a point convolution parameter; if local information of the mark is introduced, the information is strengthened by increasing the value of the point convolution parameter. Therefore, the focus context mechanism has an important role in the classification of traffic signs.

3. The invention establishes the attention loss function, the reason analysis and the advantages of the establishment.

Reason analysis: in the training and learning process of the model, along with the increase of the number of iterations, more and more samples tend to be detected and identified correctly, although the loss of the easily-separable samples can be restrained to a certain degree by the original cross entropy loss function, the overall loss of model training is still greatly influenced due to the large number of the samples, so that the training of the finally difficultly-separable samples is submerged, and the generalization capability of the model is seriously influenced, namely the detection and identification capability of the model in practical application is weak although the training precision is very high.

And (3) advantage analysis: aiming at the problem, the invention provides an attention loss function for ensuring that the model obtains full training and improving the generalization capability of the model. The advantages are as follows:

(1) the function utilizes the loss adjustment coefficient to enhance or restrain the training loss of the sample, effective distinguishing of the difficult and easy samples is achieved, and the proportion of the difficult and easy sample loss in the total training loss is greatly enhanced by restraining the loss of the easy and easy samples. On the basis, the training process of the model is changed from all samples to multi-concern difficultly-classified samples;

(2) the model focuses on the training of the difficult-to-separate samples, and effectively prevents the difficult-to-separate samples from being submerged by a large number of easy-to-separate samples in the training process on the basis of ensuring the correct training of the easy-to-separate samples. Meanwhile, when the difficultly-divided sample is correctly trained in a certain iteration, the difficultly-divided sample is converted into an easily-divided sample according to the change of the loss adjustment coefficient, and if the easily-divided sample has a training error, the easily-divided sample is converted into the difficultly-divided sample again, so that the loss adjustment process is dynamic;

(3) in the training stage, the model pays more attention to the difficult sample, so that the model can fully explore the data characteristic rule hidden in the sample set in the training process. The trained model has strong generalization capability and high detection and identification precision in specific application.

4. The invention adopts a Tsinghua university and Tencent 100K large-scale traffic sign data set jointly issued by Tencent. The advantages are as follows:

(1) the street view picture is divided from a street view picture shot in real flight, and the street view picture comprises 10 thousands of scene pictures and 3 thousands of traffic signs, comprises prohibition, warning and 3 general traffic signs for indicating, and is complete in variety. The resolution of each traffic scene picture reaches 2048 multiplied by 2048, and the traffic signs of (0, 32) pixels and (32, 96) pixels respectively occupy 41.6 percent and 49.1 percent of a data set, so that the method is very suitable for training a remote traffic sign detection and recognition model, and (2) the data set covers most of the variation conditions of illumination, weather and the like, so that the target detection model is trained by adopting the data set, the model can adapt to the detection and recognition of complex and variable remote traffic signs, and more safety and equipment response time are provided for intelligent navigation equipment in unmanned driving and auxiliary driving.

5. The invention is based on the VGG-16 convolutional neural network, adopts the deep separation convolution to replace the standard convolution in the convolutional neural network, changes the convolution calculation mode of directly expanding channels into combined calculation (a single channel of the deep convolution and a cross channel of the point convolution), realizes the model parameter compression, constructs the lightweight convolutional neural network, saves the calculation resources and reduces the hardware storage.

Fourthly, explanation of the attached drawings:

FIG. 1 is an internal structure diagram of the FL-CNN target detection recognition model of the present invention.

Fig. 2 is a channel-space attention module of the present invention.

FIG. 3 is a flow chart of the method of the present invention.

Fig. 4 shows 44 common traffic signs identified by remote detection according to the present invention, which are divided into: three categories are indicated, warned and prohibited, in the figure a category mark is represented, where il: il100, il60, il 80; ph: ph4, ph4.5, ph 5; pm: pm20, pm30, pm 55; pl: pl5, pl20, pl30, pl40, pl50, pl60, pl70, pl80, pl100, pl 120.

Fig. 5 is a statistical chart comparing the detection and identification accuracy of 44 common traffic signs according to the present invention and other methods.

The fifth embodiment is as follows:

the invention is further described below with reference to the accompanying drawings:

the method for detecting and identifying the long-distance traffic signs suitable for the vehicle-mounted system effectively overcomes the defects of low identification precision, large model and short detection distance, provides a new target detection framework, and is named as a light Convolutional Neural Network (FL-CNN) based on attention. Firstly, in the detection stage, the FL-CNN adopts deep separation convolution to reduce the number of parameters of a model, and effectively realizes model compression. Meanwhile, a channel-space attention module is invented, the original input feature map of the deep separation convolutional layer is compressed, masked and expanded, an attention feature map with feature suppression or enhancement is constructed, and the small target detection capability of the model is improved; secondly, in the identification stage of the FL-CNN, a concern context area mechanism is invented, local information of the traffic sign is introduced, the classification characteristic of the sign is enhanced, and the traffic sign area in the detection stage is prevented from being incomplete; finally, in the training stage of the FL-CNN, a difficulty sample is distinguished by an attention loss function, and the training and generalization capability of the model is improved. Finally, the model was trained using the remote traffic sign detection data set Tsinghua-Tencent 100K published by the university of qinghua and Tencent in combination.

The method comprises the following specific steps:

step 1, preprocessing a traffic sign image sample set;

(3) in order to ensure sample balance during model training, the scene instances of each type of traffic signs in the training set are more than 100. If the scene instances of a certain type of mark are less than 100, filling is carried out by adopting a repeated sampling method.

(1) the step plays an important role in compressing the mark detection and identification model, and the combined mapping of the channel and the space of the standard convolution in the original VGG-16 is separated into a separate mapping mode of the channel and the space by utilizing the deep separation convolution, so that the parameter quantity of the model and the storage of the hard disk space are effectively reduced. The network comprises 5 convolutional layers in total, wherein each convolutional layer comprises a deep convolution part and a point convolution part, and ReLU is used as an activation function;

let the convolution kernel be (D)_K,D_KC), wherein D)_KIs the convolution kernel width and height, and C is the channel of the convolution kernel. In the convolution calculation process, the depth separation convolution is N (D) calculated by the original cross-channel_K,D_KM) standard convolution conversion to M (D)_K,D_KAnd, 1) and N (1,1, M) point convolutions across channels, wherein a depth convolution is a single channel computation and a point convolution is a cross-channel computation. The input feature map is recorded as { D_F,D_FM, the output characteristic diagram is { D }_F,D_FN }, wherein D_FRepresenting the width and height of the feature map, the amount of computation for each convolution is as follows:

The calculation amount of the deep convolution is as follows: count _ D ═ D_K×D_K×M×D_F×D_F

And the calculated amount of the point convolution is as follows: count _ p is mxnxd_F×D_F

Doubling;

step 3, constructing a attention characteristic diagram through a channel-space attention module embedded into the lightweight convolutional neural network;

the main purpose of the step is to design an attention characteristic diagram, which imitates the attention mechanism of human, enhances the convolution characteristic of the small traffic sign in the scene, suppresses the characteristic of irrelevant background information, saves the computing resource and improves the detection precision. Therefore, the invention provides a channel-space attention module which can be embedded into the depth separation convolution layers, and the feature attention (suppression or enhancement) of two dimensions of the channel and the space is carried out on the output feature map of each depth separation convolution layer. Wherein the channel attention is the most meaningful way to pay attention to "what" in an image by using the interrelationship and the degree of importance between channels. In contrast, spatial concerns are about the location features of objects in the image, i.e. "where" is more effective for image detection recognition.

The output characteristic diagram of a certain depth separation convolution layer is U ═ U₁，u₂，…，u_C]，

C is the number of channels,

for a real matrix, H and W are the height and width of the profile, respectively. After passing through the channel-attention module, an attention feature map is constructed as Y ═ Y₁，y₂，…，y_C]，

The specific calculation procedure of this procedure is described as step 3.1, step 3.2 and step 3.3.

Step 3.1 channel attention force

In order to calculate the channel attention, the invention firstly compresses the spatial dimension of each channel in the feature map along the channel direction, and respectively uses three global pooling modes of maximum, average and random to aggregate the spatial information, wherein the maximum and average pooling respectively retains the texture and background features of the image, and the random pooling is between the maximum and average pooling.

It is defined as:

wherein

Secondly, three channels constructed by pooling compression are used for focusing on force mask components and are respectively used as input of the multilayer perceptron model, and aggregation is completed through point-by-point multiplication, accumulation and activation functions of weight parameters and the mask components, so that nonlinearity is further increased. Channel attention mask S ═ S for feature graph U₁,s₂,…,s_C]Is defined as follows:

S＝σ(W₁δ(W₀S_max)+W₁δ(W₀S_mean)+W₁δ(W₀S_sto))

where σ is the sigmoid function and δ is the ReLu function. W₀And W₁These parameters are shared for the three channel attention mask components for the weights of the multi-layered perceptron model.

The specific definition is as follows:

step 3.2 spatial attention

Cross-channel aggregation, the aggregated feature map is recorded as

Secondly, because the characteristics of the feature maps of different layers are greatly different, the resolution of the shallow feature map is higher, and the deep feature map is opposite to the shallow feature map and contains more abstract semantic features. Therefore, when the spatial attention mask is constructed, in order to reduce parameters and calculation amount, the spatial attention mask is respectively carried out according to regions and pixels in the shallow layer characteristic diagram and the deep layer characteristic diagram. For convenience of description, the spatial attention mask is recorded as

It is defined as follows:

N＝Softmax(Conv(M,o,k,s,p))

where Conv (·) represents a convolution operation, the output channel o is 1, the convolution kernel size k of the shallow convolution layer is 1, and the convolution kernel size k of the deep convolution layer is 3. s-1 and p-0 are the step size and padding of the convolution, respectively. In addition, in order to eliminate the influence of different feature map scales, the spatial attention mask is normalized by using a Softmax function.

Step 3.3 attention feature map

Feature maps after channel attention force mapping

Based on the obtained data, performing spatial attention mask again

As input to the next depth-separated convolutional layer, it is defined as:

wherein

Representing point-by-point multiplication.

Step 4, on the basis of the attention feature map, generating a candidate region of the target by adopting a region generation network RPN

The goal of this step is to locate areas in the traffic scene where traffic signs may appear, and then the FL-CNN classifies the signs according to these areas. The relevant details of this step are as follows:

(1) on the attention characteristic graph output by the last depth separation convolution, taking each pixel as an anchor point, and adopting 3 sizes (4, 8 and 16) and 3 proportions (1:1, 1:2 and 2:1) to generate 9 anchors on the original traffic scene;

(2) removing anchors frames that exceed the boundaries of the original input image;

(3) removing anchors frames which are repeated for a plurality of times by adopting a maximum value inhibition method, wherein the threshold value is 0.7;

(4) positive and negative samples were determined from the intersection ratio IoU of the anchor box to the true target in the sample, with IoU >0.7 being positive samples, IoU <0.3 being negative samples, and between 0.3 and 0.7 of the anchor box, again removed. Wherein IoU has the following calculation formula:

for the target candidate region given in step 4, only partial features of the traffic sign may be included, so the present invention proposes a context region information, and introduces the spatial neighboring features of the target candidate region to enhance the classification features of the sign. The method comprises the following specific steps:

And

creating a context area

The center coordinates are the same as the corresponding target candidate region. The relationship between the context area and the candidate area can be described as follows, where i is the serial number of the context area;

(2) for each target candidate area and its context area, 7 parts are divided in the horizontal and vertical directions by means of RoI-Pooling, and maximum value Pooling downsampling processing is performed on each part, so that even if the areas are different in size, the output dimension is kept consistent, and 3 feature vectors with a fixed size of 7 × 7 × 512 are generated.

(3) In the spatial dimension, the feature vectors are connected in series to form a 3 multiplied by 7 multiplied by 512 feature vector;

(4) finally, the feature vectors formed in (3) are compressed to 7 × 7 × 512 using a convolution of 1 × 1. The step not only enables the dimensionality of the feature vector introduced into the context area to meet the node requirement of the full connection layer. It is noted that the non-linear relationship between the background and the target can be learned using a 1 × 1 convolution, and when the introduced context area contains a complex background, the convolution parameters can suppress these backgrounds. Conversely, if local features of the object are introduced, the convolution parameters may enhance these features.

two full Connected networks (FC) are adopted, the first FC Network is used for classifying traffic signs, wherein 4096 nodes are hidden layers, 44 nodes are output, each output node represents a traffic sign, the value range of the traffic sign is (0,1), and the largest output node is taken as the traffic sign category during classification; (ii) a The second FC network is used for position regression of the traffic sign, in which 4096 nodes are hidden layers, and 4 nodes are output, which respectively represent the coordinates and width of the center point of the traffic sign.

Step 7, establishing a attention loss function and training the FL-CNN model

In order to ensure that the model obtains full training and improve the generalization capability of the model, the invention provides a attention loss function, which realizes effective distinguishing of difficult and easy samples, inhibits the loss of easy-to-separate samples and enhances the loss of difficult-to-separate samples. The training of the FL-CNN model comprises two parts, namely an RPN network and a full-connection layer network, wherein the loss of the RPN network comprises two classification losses and regression losses, and the loss of the full-connection network comprises a multi-classification loss and a regression loss.

(1) The attention loss function of the RPN network is as follows:

a real tag representing the object. t is t_iIs a vector containing the center coordinates and width and height information of the prediction box.

L_regRepresents the regression loss of all the bounding boxes,

Loss accommodation coefficient σ (-Kx) → 0 in (1), background sample loss

Loss regulating coefficient σ (Kx) → 0. If the sample belongs to the difficultly-divided sample, the loss adjusting coefficients of the foreground sample and the background sample are respectively as follows: σ (-Kx) → 1, σ (Kx) → 1. Therefore, the attention loss function provided by the invention can effectively distinguish the difficult and easy samples, and enables the RPN to learn and train more attention and difficult samples by inhibiting the loss of the easy and easy samples, thereby effectively ensuring that the RPN network is sufficiently trained.

where δ is the softmax function. The loss function is similar to the RPN and comprises two parts of multi-classification loss and regression loss, wherein

Is the loss of attention multi-classification proposed by the present invention. The functional property of the method is consistent with the attention two-classification loss, and the prediction probability of the sample is-log delta (x)_k) Weight δ (-Kx) → 1_k) → 0, the predicted probability of inverse samples-log δ (x)_k) Weight δ (-Kx) on → 0_k)→1。

Step 8, repeating the second step to the seventh step to finish the sample training of the FL-CNN model;

and 9, starting the color camera to photograph the actual traffic scene, preprocessing the scene before inputting the model, setting the resolution to be 2048 multiplied by 2048, inputting the scene into the FL-CNN model, and repeating the steps two to six to finish the detection and identification of the traffic sign of the actual scene.

In the aspect of the learning algorithm of the model, a new attention loss function is invented in order to enable the model to obtain full learning from a sample set. The structure of the model mainly comprises the following 5 components, wherein (1), (2) and (4) are obviously different from other models and are respectively independently invented aiming at the low power consumption, weak hardware performance and improvement of detection and identification precision of a vehicle-mounted system)

(1) Lightweight convolutional neural networks: a convolutional neural network is constructed by utilizing a deep separation convolution technology, the number of model parameters is reduced, model compression is realized, and the convolution characteristics of the traffic sign are automatically extracted layer by layer for the traffic scene;

(2) channel-space attention module: the special technology for detecting the long-distance traffic sign is an important component in a model, can be embedded into each convolution layer of a light-weight convolutional neural network to construct a feature graph of interest, realizes the inhibition or enhancement of information in the original feature graph, effectively saves the calculation resources of the model and improves the detection capability of the method;

(3) regional suggested networks (RPN): generating a certain number of target suggestion areas according to the attention characteristic diagram output by the last depth separation convolutional layer;

(4) context area pooling layer: the method is a special technology capable of effectively enhancing the information capacity of a target area, is an important component in a model, and establishes classification characteristics of the information of the concerned context area through area pooling, regularization, concatenation and compression;

(5) fully Connected Network (FC): the section is mainly responsible for carrying out specific classification and position calculation on the compressed attention context area characteristics.

Example (b):

the resolution of each traffic scene in the model training data set Tsinghua-Tencent 100K is 2048 multiplied by 2048 pixels, the size of the traffic sign is between 0 pixel and 32 pixels, and between 32 pixel and 96 pixel, the traffic sign respectively accounts for 41.6 percent and 49.1 percent of the data set, namely 90.7 percent of the traffic sign accounts for less than 1 percent of the traffic scene, and the method belongs to the situation of long-distance traffic sign detection and identification.

(1) Data set processing: in the training process of the FL-CNN model adopted by the invention, in order to keep the balance of the sample set, a resampling method is adopted for less than 100 classes of traffic scenes where each class of traffic signs are located. The ratio of the training set to the test set is 1: 2;

(2) and (3) comparison indexes: in the specific testing process, a measurement index F1-measure commonly used for measuring accuracy is used as a detection identification index, and the higher the value of the index is, the higher the detection identification precision is;

(3) comparing models: in order to verify the effectiveness of the FL-CNN, the detection precision of the FL-CNN is compared with that of Fast R-CNN and Fast R-CNN which are the most common target detection frames at present;

(4) in addition, in order to verify the detection accuracy of the traffic sign with different distances in the invention in detail, the detection accuracy of the traffic sign with different sizes (different distances) in the invention is compared respectively, and the traffic sign with three sizes (0,32 pixels, (32,96 pixels), (96,200 pixels) is included, wherein the traffic sign with (0,32 pixels and (32,96 pixels) belongs to a long distance, and the traffic sign with (96,200 pixels) belongs to a medium distance.

FL-CNN model training and field test condition description (the test is to master the feasibility of the 1 st to 8 th steps of the method)

Firstly, the hard disk space storage occupied by each model after being exported is compared. Wherein, Fast R-CNN and Fast R-CNN are VGG-16 networks, and the parameters are respectively 558Mb and 582Mb after being derived, while the FL-CNN model of the invention only accounts for 76 Mb. Compared with Fast R-CNN and Faster R-CNN, the space storage of the FL-CNN model is reduced by about 80%, and the FL-CNN model can be directly operated in a vehicle-mounted system with low power consumption and hardware performance. Therefore, the invention adopts the deep separation convolution to effectively reduce the number of model parameters and the storage of the hard disk space. Meanwhile, the channel-space attention module and the attention context area of the invention have little influence on the storage of the model, and can effectively save the computing resources.

Secondly, the traffic sign detection and identification accuracy of the FL-CNN of the invention is compared with that of the common target detection frameworks Fast R-CNN and Fast R-CNN. As is clear from Table 1, the detection and identification precision of the traffic sign at three different distances is obviously improved compared with Fast R-CNN and Fast R-CNN, wherein compared with the Fast R-CNN method, 55 percentage points are increased for (0, 32) pixels, 20 percentage points and 9 percentage points are respectively increased for (0, 32) pixels and (96, 200) pixels, and 28 percentage points are increased for the total average, which fully illustrates the effectiveness of three attention mechanisms (channel-space attention, context attention and loss function attention).

Then, to further verify the influence of the three attention mechanisms on the detection and identification accuracy of the FL-CNN model, we also compare the accuracy after removing the different attention mechanisms in table 1, where "-" indicates that the mechanism is removed from the model. It can be seen from the results that removing each attention mechanism has an impact on the detection recognition accuracy, wherein the channel-space attention impact is the largest. In addition, after the three mechanisms are removed, the FL-CNN model degenerates into the Faster R-CNN model.

TABLE 1 comparison of detection and recognition accuracy (%)

Finally, FIG. 5 shows a comparison statistical chart of recognition accuracy of 44 traffic signs by different target detection recognition frameworks, and it can be seen that the detection recognition accuracy of each traffic sign of FL-CNN is higher than that of Fast R-CNN and that of Fast R-CNN.

Claims

1. A remote traffic sign detection and identification method suitable for a vehicle-mounted system is characterized by comprising the following steps:

firstly, preprocessing a traffic sign image sample set;

constructing a lightweight convolution neural network to complete the convolution feature extraction of the traffic sign;

Doubling;

constructing an attention characteristic diagram through a channel-space attention module embedded into the lightweight convolutional neural network;

the output characteristic diagram of a certain depth separation convolution layer is U ═ U [ [ U ] ]₁,u₂,…,u_C]，

C is the number of channels,

the real number matrix is adopted, H and W are respectively the height and width of the feature map, and after passing through the channel-attention model, the constructed attention feature map is Y ═ Y₁,y₂,…,y_C]，

The specific calculation process of the process is as follows:

step 3.1 channel attention force

Firstly, compressing the spatial dimension of each channel in a feature map along the channel direction, and aggregating spatial information by using three global pooling modes of maximum, average and random respectively, wherein the maximum and average pooling respectively retains the texture and background features of an image, and the random pooling is between the maximum and average pooling;

It is defined as:

compressing each u according to global average pooling and global random pooling_iTo channel attention mask component S_meanAnd S_stoWherein it is defined as:

wherein

Secondly, three channel attention force mask components constructed by pooling compression are respectively used as the input of the multilayer perceptron model, aggregation is completed through point-by-point multiplication, accumulation and activation functions of weight parameters and mask components, nonlinearity is further increased, and the channel attention force mask S of the characteristic diagram U is [ S ═ S [₁,s₂,…,s_C]Is defined as follows:

S＝σ(W₁δ(W₀S_max)+W₁δ(W₀S_mean)+W₁δ(W₀S_sto))

where σ is sigmoid function, δ is ReLu function, W₀And W₁For multi-layer perceptron modelWeights of types that are shared for the three channel attention mask components;

The specific definition is as follows:

step 3.2 spatial attention

Cross-channel aggregation, the aggregated feature map is recorded as

Secondly, because the characteristics of the feature maps of different layers are greatly different, the shallow feature map has higher resolution, and the deep feature map has opposite resolution and contains more abstract semantic features; therefore, when the spatial attention is constructed, in order to reduce parameters and calculation amount, the mask of the spatial attention is respectively carried out according to regions and pixels in the shallow layer characteristic diagram and the deep layer characteristic diagram; spatial attention mask of

It is defined as follows:

N＝Softmax(Conv(M,o,k,s,p))

step 3.3 attention force feature map

Feature maps after channel attention force mapping

Based on the obtained data, performing spatial attention mask again

According to the spatial attention mask, the spatial attention is calibrated for each channel of the characteristic diagram X, and finally the output characteristic diagram of the deep separation convolutional layer is generated

As input to the next depth-separated convolutional layer, it is defined as:

wherein

Represents point-by-point multiplication;

fourthly, on the basis of the attention characteristic graph, generating a candidate region of the target by adopting a region generation network RPN

introducing context area information into a target candidate area generated by the RPN, and enhancing the classification characteristics of the marks;

And

creating a context area

The center coordinates are the same as the corresponding target candidate region, and the relationship between the context region and the candidate region can be described as follows, wherein i is the serial number of the context region;

step six, sending the feature vectors into a full-connection layer for classification and regression, and outputting the classes and positions of the traffic signs;

step seven, establishing a attention loss function and training the FL-CNN model

(1) the attention loss function of the RPN network is as follows:

real tag representing an object, t_iIs a vector containing the center coordinates of the prediction box, width and height information,

information vector representing a real box, N_clsDenotes the total number of anchors, N_regSize of the characteristic diagram, and lambda is an adjustment coefficientGet it

L_regRepresents the regression loss of all the bounding boxes,

is attention two classification loss, specifically defined as follows:

where σ is sigmoid function, the prediction probability of foreground sample is-log σ (x), the prediction probability of background sample is-log σ (-x), and K is constant, the loss function has the following characteristics: if the sample belongs to an easily separable sample, — log σ (x) → 1 or-log σ (-x) → 0, that is, σ (x) → 1/e or σ (-x) → 1, then when K is taken as a large value, the foreground sample is lost

Loss accommodation coefficient σ (-Kx) → 0 in (1), background sample loss

(2) the attention loss function for a full-connection layer network is as follows:

where δ is the softmax function, the loss function is similar to RPN, including multiple divisionsClass loss and regression loss, wherein

Step eight, repeating the step two to the step seven to finish the sample training of the FL-CNN model;

and step nine, starting the color camera to photograph the actual traffic scene, preprocessing the scene before inputting the model, setting the resolution to 2048 multiplied by 2048, inputting the scene into the FL-CNN model, and repeating the step two to the step six to finish the detection and identification of the traffic sign of the actual scene.

2. The method for detecting and identifying remote traffic signs suitable for use in vehicle-mounted systems according to claim 1, wherein: the first step is specifically as follows:

3. The method for detecting and identifying remote traffic signs suitable for use in vehicle-mounted systems according to claim 2, wherein: the fourth step is specifically as follows:

(6) and all the regional suggestion frames pass through a fully connected layer of the regional generation network RPN to obtain target candidate regions.