CN110188705B - Remote traffic sign detection and identification method suitable for vehicle-mounted system - Google Patents

Remote traffic sign detection and identification method suitable for vehicle-mounted system Download PDF

Info

Publication number
CN110188705B
CN110188705B CN201910474059.7A CN201910474059A CN110188705B CN 110188705 B CN110188705 B CN 110188705B CN 201910474059 A CN201910474059 A CN 201910474059A CN 110188705 B CN110188705 B CN 110188705B
Authority
CN
China
Prior art keywords
attention
convolution
channel
loss
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910474059.7A
Other languages
Chinese (zh)
Other versions
CN110188705A (en
Inventor
刘志刚
杜娟
田枫
韩玉祥
高雅田
张可佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeast Petroleum University
Original Assignee
Northeast Petroleum University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeast Petroleum University filed Critical Northeast Petroleum University
Priority to CN201910474059.7A priority Critical patent/CN110188705B/en
Publication of CN110188705A publication Critical patent/CN110188705A/en
Application granted granted Critical
Publication of CN110188705B publication Critical patent/CN110188705B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • G06V20/582Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads of traffic signs

Abstract

The invention relates to a remote traffic sign detection and identification method suitable for a vehicle-mounted system, which comprises the following steps: 1. preprocessing a traffic sign image sample set; 2. constructing a lightweight convolutional neural network to complete the convolutional feature extraction of the traffic sign; 3. constructing an attention feature map through a channel-space attention module embedded into a lightweight convolutional neural network; 4. generating a candidate region of a target by adopting a region generation network RPN; 5. introducing context area information to a target candidate area generated by RPN, and enhancing the mark classification characteristics; 6. sending the characteristic vectors into a full connection layer, and outputting the types and positions of the traffic signs; 7. establishing an attention loss function, and training a FL-CNN model; 8. repeating the steps from 2 to 7 to finish the sample training of the FL-CNN model; 9. and repeating the steps from 2 to 6 to complete the detection and identification of the traffic signs in the actual scene. The invention realizes the detection and identification of the long-distance traffic sign with the accuracy reaching 92 percent.

Description

Remote traffic sign detection and identification method suitable for vehicle-mounted system
The technical field is as follows:
the invention relates to the field of intelligent traffic oriented to unmanned driving and auxiliary driving, solves a long-distance detection and identification method of a road traffic sign, and particularly relates to a long-distance traffic sign detection and identification method suitable for a vehicle-mounted system.
Secondly, background art:
in the field of intelligent transportation, traffic sign detection and identification are important research problems of unmanned driving, auxiliary driving and other systems. Many research works are carried out at home and abroad, but the method still has great defects and cannot be practically applied to practice. The reason is as follows: (1) the traditional detection and identification method designed based on characteristics such as color and shape has poor robustness in the face of sign deformation, motion blur, weather and other conditions in actual traffic scenes, and is difficult to be applied in practice; (2) in the existing detection and identification method based on deep learning, parameter files derived from a model are huge, and the method needs larger memory and hard disk storage during operation, cannot be directly operated on a vehicle-mounted system with lower power consumption and hardware performance, and has poorer practicability; (3) in the aspect of data sets, some methods directly use the data sets shot by the user, the change of the data quantity and the marks is less, and the model is difficult to be practically applied in practice; in addition, some methods are based on the published german traffic sign data sets GTSRB and GTSDB, but the size of the signs in these data sets is large and the number of detected sign types is small. Since the data set has a great influence on the performance of the model, these methods all belong to the detection and identification in a short distance at present.
The vehicle-mounted system in unmanned driving and auxiliary driving belongs to an embedded system, so that the power consumption and hardware performance of a power supply are low, and a huge deep learning model cannot be directly operated. Meanwhile, the remote detection identification mark can provide more response time for the automobile driving, and plays an important role in improving the intelligent driving safety of the automobile. In the technology, the long-distance traffic sign detection and identification belongs to the problem of small target identification in a complex background, which is a difficult problem in the field of computer vision at present, and the existing method is difficult to obtain higher detection and identification precision.
Thirdly, the invention content:
the invention aims to provide a long-distance traffic sign detection and identification method suitable for an on-vehicle system, which is used for solving the problem of low precision when the existing short-distance detection and identification method is used for detecting and identifying long-distance traffic signs.
The technical scheme adopted by the invention for solving the technical problems is as follows: the remote traffic sign detection and identification method suitable for the vehicle-mounted system comprises the following steps:
step 1, preprocessing a traffic sign image sample set;
step 2, constructing a lightweight convolutional neural network to complete the convolutional characteristic extraction of the traffic sign;
(1) the method comprises the steps that the combined mapping of a channel and a space of the original VGG-16 standard convolution is separated into a single mapping mode of the channel and the space by utilizing deep separation convolution, the parameter number of a model and the storage of hard disk space are reduced, a lightweight convolutional neural network comprises 5 convolutional layers in total, each convolutional layer comprises two parts of deep convolution and point convolution, and ReLU is used as an activation function;
(2) in the lightweight convolutional neural network, the computation amount of the deep separation convolution and the original standard convolution is compared as follows:
let the convolution kernel be (D)K,DKC), wherein D)KThe width and height of a convolution kernel, and C is a channel of the convolution kernel; in the convolution calculation process, the depth separation convolution is N (D) calculated by the original cross-channelK,DKM) standard convolution conversion to M (D)K,DK1) and N (1,1, M) point convolutions across channels, wherein a depth convolution is a single channel computation and a point convolution is a cross-channel computation; the input feature map is recorded as { DF,DFM, the output characteristic diagram is { D }F,DFN }, wherein DFRepresenting the width and height of the feature map, the amount of computation for each convolution is as follows:
the calculated amount of the standard convolution is: count _ s ═ DK×DK×M×N×DF×DF
② deep convolutionThe calculated amount is as follows: count _ D ═ DK×DK×M×DF×DF
And the calculated amount of the point convolution is as follows: count _ p ═ mxnxdF×DF
Therefore, the comparison relationship between the computation amounts of the deep separation convolution and the standard convolution is as follows, and the computation amount is reduced compared with the original standard convolution by using the deep separation convolution every time
Figure GDA0002114862600000021
Doubling;
Figure GDA0002114862600000022
step 3, constructing an attention characteristic diagram through a channel-space attention module embedded into the lightweight convolutional neural network;
the method comprises the steps of embedding a channel-space attention module into deep separation convolutional layers, and performing characteristic attention of two dimensions of a channel and a space on output characteristic graphs of each deep separation convolutional layer; wherein the channel attention is most meaningful in paying attention to 'what' in an image by using the interrelation and the importance degree among channels; spatial concerns are the location features of the target in the image, "where" is more effective for image detection identification;
the output characteristic diagram of a certain depth separation convolution layer is U ═ U [ [ U ] ]1,u2,…,uC],
Figure GDA0002114862600000031
C is the number of channels,
Figure GDA0002114862600000032
for a real matrix, H and W are the height and width of the profile, respectively. After passing through the channel-attention module, constructed attention featuresIn the figure, Y is ═ Y1,y2,…,yC],
Figure GDA0002114862600000033
The specific calculation process of the process is as follows:
step 3.1 channel attention force
For calculating the channel attention, firstly, compressing the spatial dimension of each channel in a feature map along the channel direction, and aggregating spatial information by using three global pooling modes of maximum, average and random respectively, wherein the maximum and average pooling respectively retains the texture and background features of an image, and the random pooling is between the maximum and average pooling;
first, according to the global maximum pooling, each u is respectivelyiCompression to channel attention mask component
Figure GDA0002114862600000034
It is defined as:
Figure GDA0002114862600000035
compressing each u according to global average pooling and global random poolingiTo channel attention force mask component SmeanAnd SstoWherein it is defined as:
Figure GDA0002114862600000036
Figure GDA0002114862600000037
wherein
Figure GDA0002114862600000038
Secondly, three channel attention force mask components constructed by pooling compression are respectively used as the input of a multilayer perceptron model, and through point-by-point phase of weight parameters and the mask componentsThe multiply, accumulate, and activate functions complete the aggregation, further increasing the non-linearity. Channel attention mask S ═ S for feature graph U1,s2,…,sC]Is defined as follows:
S=σ(W1δ(W0Smax)+W1δ(W0Smean)+W1δ(W0Ssto))
where σ is the sigmoid function and δ is the ReLu function. W0And W1These parameters are shared for the three channel attention mask components, which are weights of the multi-layered perceptron model;
finally, expanding the channel attention mask to the original input feature graph U, calibrating the weight of each channel in the feature graph U according to the mask, and recording a new feature graph after channel attention mapping as
Figure GDA0002114862600000041
The specific definition is as follows:
Figure GDA0002114862600000042
step 3.2 spatial attention
In order to calculate the spatial attention mapping and construct the characteristic relationship between pixels or regions in the characteristic diagram, the channel number of the characteristic diagram is firstly compressed, and a group of pointwise convolutions are adopted to carry out the original input characteristic diagram of the depth separation convolution layer
Figure GDA0002114862600000043
Cross-channel aggregation, the aggregated feature map is recorded as
Figure GDA0002114862600000044
Secondly, because the characteristics of the feature maps of different layers are greatly different, the shallow feature map has higher resolution, and the deep feature map has the opposite, and contains more abstract semantic features; therefore, in the invention, when the spatial attention is constructed, in order to reduce parameters and calculation amount, the mask of the spatial attention is in a shallow layer andthe deep characteristic map is respectively carried out according to regions and pixels; spatial attention mask of
Figure GDA0002114862600000045
It is defined as follows:
N=Softmax(Conv(M,o,k,s,p))
wherein Conv (·) represents convolution operation, an output channel o is 1, a convolution kernel size k of the shallow convolution layer is 1, a convolution kernel size k of the deep convolution layer is 3, s is 1 and p is 0, which are respectively step length and filling of convolution, and furthermore, in order to eliminate different influences of different feature map scales, a spatial attention mask is normalized by utilizing a Softmax function;
step 3.3 attention force feature map
Feature maps after channel attention force mapping
Figure GDA0002114862600000046
Based on the obtained data, performing spatial attention mask again
Figure GDA0002114862600000047
The expansion of (2). Calibrating the spatial attention of each channel of the feature map X according to the spatial attention mask, and finally generating an output feature map of the deep separation convolutional layer
Figure GDA0002114862600000048
As input to the next depth-separated convolutional layer, it is defined as:
Figure GDA0002114862600000049
wherein
Figure GDA00021148626000000410
Represents point-by-point multiplication;
step 4, on the basis of the attention feature map, generating a candidate region of a target by using a region generation network RPN;
positioning areas where traffic signs may appear in a traffic scene, and then classifying the signs according to the areas by the FL-CNN;
step 5, introducing context area information to a target candidate area generated by the RPN, and enhancing the classification characteristics of the marks;
and for the target candidate region given in the step four, only including partial features of the traffic sign, introducing the spatial neighboring features of the target candidate region to enhance the classification features of the sign, and specifically comprising the following steps:
(1) for convenience of description, a certain target candidate region is denoted as p ═ p (p)x,py,pw,ph) Wherein (p)x,py) Is the center of the region (p)w,ph) Representing the width and height of the region, using a scale factor on the final depth-separated convolution output attention profile
Figure GDA0002114862600000051
And
Figure GDA0002114862600000052
creating a context area
Figure GDA0002114862600000053
The center coordinates are the same as the corresponding target candidate region; the relationship between the context area and the candidate area can be described as follows, where i is the serial number of the context area;
Figure GDA0002114862600000054
(2) for each target candidate area and its context area, dividing into 7 parts in horizontal and vertical directions by RoI-Pooling, and performing maximum value Pooling down-sampling processing on each part, wherein even if the areas are different in size, the output dimension is kept consistent, and 3 feature vectors with fixed size of 7 × 7 × 512 are generated;
(3) in the spatial dimension, the feature vectors are connected in series to form a 3 multiplied by 7 multiplied by 512;
(4) compressing the feature vectors formed in the convolution pair (3) of 1 × 1 to 7 × 7 × 512, enabling the dimensions of the feature vectors introduced into the context region to meet the node requirements of the full-connection layer, learning the nonlinear relation between the background and the target by utilizing the convolution of 1 × 1, when the introduced context region contains complex backgrounds, the convolution parameters restrain the backgrounds, and on the contrary, if the local features of the target are introduced, the convolution parameters strengthen the features;
step 6, sending the feature vectors into a full-connection layer for classification and regression, and outputting the classes and positions of the traffic signs;
two full-Connected networks (FC) are adopted, the first FC Network is used for classifying traffic signs, wherein 4096 nodes are hidden layers, 44 nodes are output, each output node represents a traffic sign, the value range of the traffic sign is (0,1), and the largest output node is taken as the traffic sign category during classification; the second FC network is used for position regression of the traffic sign, wherein the hidden layer is 4096 nodes, 4 nodes are output and respectively represent the coordinate and the width of the center point of the traffic sign;
step 7, establishing a attention loss function and training a FL-CNN model;
in order to ensure that the model obtains full training and improve the generalization capability of the model, an attention loss function is established, the difficult and easy samples are effectively distinguished, the loss of the easy-to-separate samples is inhibited, and the loss of the difficult-to-separate samples is enhanced; the FL-CNN model training comprises two parts, namely an RPN network and a full-connection layer network, wherein the loss of the RPN network comprises two classification losses and regression losses, and the loss of the full-connection network comprises multi-classification loss and regression loss;
(1) the attention loss function of the RPN network is as follows:
Figure GDA0002114862600000061
wherein p isiIndicating the predicted probability that the ith anchor is the target object,
Figure GDA0002114862600000062
a real tag representing the object. t is tiIs a directionAnd the quantity comprises the center coordinates and width and height information of the prediction frame.
Figure GDA0002114862600000063
An information vector representing a real box. N is a radical ofclsDenotes the total number of anchors, NregThe size of the characteristic diagram is shown, and lambda is an adjusting coefficient and can be taken
Figure GDA0002114862600000064
LregRepresents the regression loss of all the bounding boxes,
Figure GDA0002114862600000065
the attention two-classification loss provided by the invention is specifically defined as follows:
Figure GDA0002114862600000066
wherein sigma is sigmoid function, the prediction probability of foreground sample is-log sigma (x), the prediction probability of background sample is-log sigma (-x), and K is constant. The loss function has the following characteristics: if the sample belongs to an easily separable sample, — log σ (x) → 1 or-log σ (-x) → 0, that is, σ (x) → 1/e or σ (-x) → 1, then when K is taken as a large value, the foreground sample is lost
Figure GDA0002114862600000067
Loss accommodation coefficient σ (-Kx) → 0 in (1), background sample loss
Figure GDA0002114862600000068
Loss adjusting coefficient σ (Kx) → 0; if the sample belongs to the difficultly-divided sample, the loss adjusting coefficients of the foreground sample and the background sample are respectively as follows: sigma (-Kx) → 1 and sigma (Kx) → 1, so that the attention loss function effectively distinguishes the difficult and easy samples, and the RPN learning and training are enabled to pay more attention to the difficult and easy samples by inhibiting the loss of the easy and easy samples, so that the RPN network is ensured to be fully trained;
(2) the attention loss function for a full-connectivity layer network is as follows:
Figure GDA0002114862600000071
where δ is the softmax function, which is similar to the RPN, and includes both multi-classification loss and regression loss, where
Figure GDA0002114862600000072
Is concerned with multi-classification loss, the functional property of which is consistent with concerned with two-classification loss, when the prediction probability of the sample is-log delta (x)k) Weight δ (-Kx) → 1k) → 0, predicted probability of opposite sample-log δ (x)k) Weight δ (-Kx) on → 0k)→1;
Step 8, repeating the step 2 to the step 7 to finish the sample training of the FL-CNN model;
and 9, starting the color camera to photograph the actual traffic scene, preprocessing the scene before inputting the model, setting the resolution to be 2048 multiplied by 2048, inputting the scene into the FL-CNN model, and repeating the steps 2 to 6 to finish the detection and identification of the traffic sign of the actual scene.
The step 1 in the scheme is specifically as follows:
(1) adopting a Tsinghua university and Tencent 100K data set jointly published by Tencent corporation, and selecting 44 types of common traffic signs as remote detection and identification objects;
(2) dividing the Tsinghua-Tencent 100K data set into a training set and a testing set according to the proportion of 1: 2;
(3) in order to ensure the sample balance during the FL-CNN model training, the scene instances of each type of traffic signs in the training set are more than 100, and if the scene instances of a certain type of signs are less than 100, a repeated sampling method is adopted for filling.
The step 4 in the scheme is specifically as follows:
(1) on an attention characteristic graph output by the last depth separation convolution, taking each pixel as an anchor point, and adopting 3 proportions and 3 sizes of 1:1, 1:2 and 2:1 to generate 9 anchors on an original traffic scene, wherein the 3 sizes are 4, 8 and 16 respectively;
(2) removing anchors frames that exceed the original input image boundaries;
(3) removing anchor frames which are repeated more by adopting a maximum value inhibition method, wherein the threshold value is 0.7;
(4) from the intersection ratio IoU of the anchor box to the true target in the sample, positive and negative samples were determined, where IoU >0.7 are positive samples, IoU <0.3 are negative samples, and between 0.3 and 0.7 the anchor box was removed again, where IoU is calculated as follows:
Figure GDA0002114862600000081
(5) according to the translation invariance, each anchor frame corresponds to an area suggestion frame on the fusion feature map;
(6) and after all the regional suggestion frames pass through the fully-connected layer of the regional generation network RPN, obtaining a target candidate region.
Has the advantages that:
1. the invention mainly adopts the deep learning technology to solve the remote detection and identification of the road traffic sign, the method can be directly operated on a vehicle-mounted system with lower power consumption and hardware performance, and the test result shows that the method effectively saves the computing resource, the size of the model is only 76Mb, the identification precision reaches 92 percent, and the method is suitable for the remote traffic sign detection and identification requirements of the vehicle-mounted system.
2. The invention establishes a light-weight concerned convolutional neural network, and the reason analysis and the advantages of the establishment are as follows:
reason analysis: the existing traffic sign detection model directly utilizes the convolutional neural network to extract image features, although the learning and extraction of the features can be automatically completed from a mass data set in this way, it is worth noting that the model parameters of the convolutional neural network occupy a large storage space (for example, the VGG-16 network reaches 527Mb) after being derived, and a large memory and power consumption are required during operation. The vehicle-mounted system belongs to an embedded low-power-consumption system, and the hardware performance is low, so that the huge models cannot be directly operated to detect the traffic signs. In the prior art, a detection and identification model is operated on a remote server, a vehicle-mounted camera shoots a traffic scene and then transmits the traffic scene to a server through a network, and then receives a detection and identification result of the server through the network.
And (3) advantage analysis: aiming at the problems, the invention designs a light Convolutional Neural Network (FL-CNN) based on attention. The model has the following advantages:
(1) in the process of extracting features by convolution, for each convolution layer, the FL-CNN uses deep separation convolution to replace the original standard convolution, greatly reduces the number of model parameters, and compresses the storage space of the model to 76Mb, thereby being suitable for a vehicle-mounted system;
(2) in the detection stage of the FL-CNN, a channel-space attention module is invented, and the original input feature map of the depth separation convolutional layer is compressed, masked and expanded to construct an attention feature map with feature suppression or enhancement, so that the model can quickly and accurately detect a target in a complex background. The attention feature map respectively reinforces the salient features and the region position information of the image through two dimensions of a channel and a space, so that not only can the model pay attention to the information during calculation and save calculation resources, but also the detection speed and the detection precision can be effectively improved through the attention channel features and the region reinforcement during detection of the model;
(3) the FL-CNN invents a context-focused mechanism in the identification phase. The target area extracted in the detection stage often has the phenomenon of incomplete mark area, so that part of critical information of the mark is lost. Because the concern context mechanism realizes the local information introduction of the traffic sign through the context of the target area, the classification characteristic of the sign is enhanced, and the incomplete traffic sign area in the detection stage is effectively prevented. Meanwhile, the nonlinear relation between the target and the context region can be dynamically learned by adopting point convolution, and if background interference information is introduced, the inhibition of the interference information is completed by reducing the value of a point convolution parameter; if local information of the mark is introduced, the information is strengthened by increasing the value of the point convolution parameter. Therefore, the focus context mechanism has an important role in the classification of traffic signs.
3. The invention establishes the attention loss function, the reason analysis and the advantages of the establishment.
Reason analysis: in the training and learning process of the model, along with the increase of the number of iterations, more and more samples tend to be detected and identified correctly, although the loss of the easily-separable samples can be restrained to a certain degree by the original cross entropy loss function, the overall loss of model training is still greatly influenced due to the large number of the samples, so that the training of the finally difficultly-separable samples is submerged, and the generalization capability of the model is seriously influenced, namely the detection and identification capability of the model in practical application is weak although the training precision is very high.
And (3) advantage analysis: aiming at the problem, the invention provides an attention loss function for ensuring that the model obtains full training and improving the generalization capability of the model. The advantages are as follows:
(1) the function utilizes the loss adjustment coefficient to enhance or restrain the training loss of the sample, effective distinguishing of the difficult and easy samples is achieved, and the proportion of the difficult and easy sample loss in the total training loss is greatly enhanced by restraining the loss of the easy and easy samples. On the basis, the training process of the model is changed from all samples to multi-concern difficultly-classified samples;
(2) the model focuses on the training of the difficult-to-separate samples, and effectively prevents the difficult-to-separate samples from being submerged by a large number of easy-to-separate samples in the training process on the basis of ensuring the correct training of the easy-to-separate samples. Meanwhile, when the difficultly-divided sample is correctly trained in a certain iteration, the difficultly-divided sample is converted into an easily-divided sample according to the change of the loss adjustment coefficient, and if the easily-divided sample has a training error, the easily-divided sample is converted into the difficultly-divided sample again, so that the loss adjustment process is dynamic;
(3) in the training stage, the model pays more attention to the difficult sample, so that the model can fully explore the data characteristic rule hidden in the sample set in the training process. The trained model has strong generalization capability and high detection and identification precision in specific application.
4. The invention adopts a Tsinghua university and Tencent 100K large-scale traffic sign data set jointly issued by Tencent. The advantages are as follows:
(1) the street view picture is divided from a street view picture shot in real flight, and the street view picture comprises 10 thousands of scene pictures and 3 thousands of traffic signs, comprises prohibition, warning and 3 general traffic signs for indicating, and is complete in variety. The resolution of each traffic scene picture reaches 2048 multiplied by 2048, and the traffic signs of (0, 32) pixels and (32, 96) pixels respectively occupy 41.6 percent and 49.1 percent of a data set, so that the method is very suitable for training a remote traffic sign detection and recognition model, and (2) the data set covers most of the variation conditions of illumination, weather and the like, so that the target detection model is trained by adopting the data set, the model can adapt to the detection and recognition of complex and variable remote traffic signs, and more safety and equipment response time are provided for intelligent navigation equipment in unmanned driving and auxiliary driving.
5. The invention is based on the VGG-16 convolutional neural network, adopts the deep separation convolution to replace the standard convolution in the convolutional neural network, changes the convolution calculation mode of directly expanding channels into combined calculation (a single channel of the deep convolution and a cross channel of the point convolution), realizes the model parameter compression, constructs the lightweight convolutional neural network, saves the calculation resources and reduces the hardware storage.
Fourthly, explanation of the attached drawings:
FIG. 1 is an internal structure diagram of the FL-CNN target detection recognition model of the present invention.
Fig. 2 is a channel-space attention module of the present invention.
FIG. 3 is a flow chart of the method of the present invention.
Fig. 4 shows 44 common traffic signs identified by remote detection according to the present invention, which are divided into: three categories are indicated, warned and prohibited, in the figure a category mark is represented, where il: il100, il60, il 80; ph: ph4, ph4.5, ph 5; pm: pm20, pm30, pm 55; pl: pl5, pl20, pl30, pl40, pl50, pl60, pl70, pl80, pl100, pl 120.
Fig. 5 is a statistical chart comparing the detection and identification accuracy of 44 common traffic signs according to the present invention and other methods.
The fifth embodiment is as follows:
the invention is further described below with reference to the accompanying drawings:
the method for detecting and identifying the long-distance traffic signs suitable for the vehicle-mounted system effectively overcomes the defects of low identification precision, large model and short detection distance, provides a new target detection framework, and is named as a light Convolutional Neural Network (FL-CNN) based on attention. Firstly, in the detection stage, the FL-CNN adopts deep separation convolution to reduce the number of parameters of a model, and effectively realizes model compression. Meanwhile, a channel-space attention module is invented, the original input feature map of the deep separation convolutional layer is compressed, masked and expanded, an attention feature map with feature suppression or enhancement is constructed, and the small target detection capability of the model is improved; secondly, in the identification stage of the FL-CNN, a concern context area mechanism is invented, local information of the traffic sign is introduced, the classification characteristic of the sign is enhanced, and the traffic sign area in the detection stage is prevented from being incomplete; finally, in the training stage of the FL-CNN, a difficulty sample is distinguished by an attention loss function, and the training and generalization capability of the model is improved. Finally, the model was trained using the remote traffic sign detection data set Tsinghua-Tencent 100K published by the university of qinghua and Tencent in combination.
The method comprises the following specific steps:
step 1, preprocessing a traffic sign image sample set;
(1) adopting a Tsinghua university and Tencent 100K data set jointly published by Tencent corporation, and selecting 44 types of common traffic signs as remote detection and identification objects;
(2) dividing the Tsinghua-Tencent 100K data set into a training set and a testing set according to the proportion of 1: 2;
(3) in order to ensure sample balance during model training, the scene instances of each type of traffic signs in the training set are more than 100. If the scene instances of a certain type of mark are less than 100, filling is carried out by adopting a repeated sampling method.
Step 2, constructing a lightweight convolutional neural network to complete the convolutional characteristic extraction of the traffic sign;
(1) the step plays an important role in compressing the mark detection and identification model, and the combined mapping of the channel and the space of the standard convolution in the original VGG-16 is separated into a separate mapping mode of the channel and the space by utilizing the deep separation convolution, so that the parameter quantity of the model and the storage of the hard disk space are effectively reduced. The network comprises 5 convolutional layers in total, wherein each convolutional layer comprises a deep convolution part and a point convolution part, and ReLU is used as an activation function;
(2) in the lightweight convolutional neural network, the computation amount of the deep separation convolution and the original standard convolution is compared as follows:
let the convolution kernel be (D)K,DKC), wherein D)KIs the convolution kernel width and height, and C is the channel of the convolution kernel. In the convolution calculation process, the depth separation convolution is N (D) calculated by the original cross-channelK,DKM) standard convolution conversion to M (D)K,DKAnd, 1) and N (1,1, M) point convolutions across channels, wherein a depth convolution is a single channel computation and a point convolution is a cross-channel computation. The input feature map is recorded as { DF,DFM, the output characteristic diagram is { D }F,DFN }, wherein DFRepresenting the width and height of the feature map, the amount of computation for each convolution is as follows:
the calculated amount of the standard convolution is: count _ s ═ DK×DK×M×N×DF×DF
The calculation amount of the deep convolution is as follows: count _ D ═ DK×DK×M×DF×DF
And the calculated amount of the point convolution is as follows: count _ p is mxnxdF×DF
Therefore, the comparison relationship between the computation amounts of the deep separation convolution and the standard convolution is as follows, and the computation amount is reduced compared with the original standard convolution by using the deep separation convolution every time
Figure GDA0002114862600000121
Doubling;
Figure GDA0002114862600000122
step 3, constructing a attention characteristic diagram through a channel-space attention module embedded into the lightweight convolutional neural network;
the main purpose of the step is to design an attention characteristic diagram, which imitates the attention mechanism of human, enhances the convolution characteristic of the small traffic sign in the scene, suppresses the characteristic of irrelevant background information, saves the computing resource and improves the detection precision. Therefore, the invention provides a channel-space attention module which can be embedded into the depth separation convolution layers, and the feature attention (suppression or enhancement) of two dimensions of the channel and the space is carried out on the output feature map of each depth separation convolution layer. Wherein the channel attention is the most meaningful way to pay attention to "what" in an image by using the interrelationship and the degree of importance between channels. In contrast, spatial concerns are about the location features of objects in the image, i.e. "where" is more effective for image detection recognition.
The output characteristic diagram of a certain depth separation convolution layer is U ═ U1,u2,…,uC],
Figure GDA0002114862600000123
C is the number of channels,
Figure GDA0002114862600000124
for a real matrix, H and W are the height and width of the profile, respectively. After passing through the channel-attention module, an attention feature map is constructed as Y ═ Y1,y2,…,yC],
Figure GDA0002114862600000125
The specific calculation procedure of this procedure is described as step 3.1, step 3.2 and step 3.3.
Step 3.1 channel attention force
In order to calculate the channel attention, the invention firstly compresses the spatial dimension of each channel in the feature map along the channel direction, and respectively uses three global pooling modes of maximum, average and random to aggregate the spatial information, wherein the maximum and average pooling respectively retains the texture and background features of the image, and the random pooling is between the maximum and average pooling.
First, according to the global maximum pooling, each u is respectivelyiCompression to channel attention mask component
Figure GDA0002114862600000131
It is defined as:
Figure GDA0002114862600000132
compressing each u according to global average pooling and global random poolingiTo channel attention force mask component SmeanAnd SstoWherein it is defined as:
Figure GDA0002114862600000133
Figure GDA0002114862600000134
wherein
Figure GDA0002114862600000135
Secondly, three channels constructed by pooling compression are used for focusing on force mask components and are respectively used as input of the multilayer perceptron model, and aggregation is completed through point-by-point multiplication, accumulation and activation functions of weight parameters and the mask components, so that nonlinearity is further increased. Channel attention mask S ═ S for feature graph U1,s2,…,sC]Is defined as follows:
S=σ(W1δ(W0Smax)+W1δ(W0Smean)+W1δ(W0Ssto))
where σ is the sigmoid function and δ is the ReLu function. W0And W1These parameters are shared for the three channel attention mask components for the weights of the multi-layered perceptron model.
Finally, expanding the channel attention mask to the original input feature graph U, calibrating the weight of each channel in the feature graph U according to the mask, and recording a new feature graph after channel attention mapping as
Figure GDA0002114862600000136
The specific definition is as follows:
Figure GDA0002114862600000141
step 3.2 spatial attention
In order to calculate the spatial attention mapping and construct the characteristic relationship between pixels or regions in the characteristic diagram, the channel number of the characteristic diagram is firstly compressed, and a group of pointwise convolutions are adopted to carry out the original input characteristic diagram of the depth separation convolution layer
Figure GDA0002114862600000142
Cross-channel aggregation, the aggregated feature map is recorded as
Figure GDA0002114862600000143
Secondly, because the characteristics of the feature maps of different layers are greatly different, the resolution of the shallow feature map is higher, and the deep feature map is opposite to the shallow feature map and contains more abstract semantic features. Therefore, when the spatial attention mask is constructed, in order to reduce parameters and calculation amount, the spatial attention mask is respectively carried out according to regions and pixels in the shallow layer characteristic diagram and the deep layer characteristic diagram. For convenience of description, the spatial attention mask is recorded as
Figure GDA0002114862600000144
It is defined as follows:
N=Softmax(Conv(M,o,k,s,p))
where Conv (·) represents a convolution operation, the output channel o is 1, the convolution kernel size k of the shallow convolution layer is 1, and the convolution kernel size k of the deep convolution layer is 3. s-1 and p-0 are the step size and padding of the convolution, respectively. In addition, in order to eliminate the influence of different feature map scales, the spatial attention mask is normalized by using a Softmax function.
Step 3.3 attention feature map
Feature maps after channel attention force mapping
Figure GDA0002114862600000145
Based on the obtained data, performing spatial attention mask again
Figure GDA0002114862600000146
The expansion of (2). Calibrating the spatial attention of each channel of the feature map X according to the spatial attention mask, and finally generating an output feature map of the deep separation convolutional layer
Figure GDA0002114862600000147
As input to the next depth-separated convolutional layer, it is defined as:
Figure GDA0002114862600000148
wherein
Figure GDA0002114862600000149
Representing point-by-point multiplication.
Step 4, on the basis of the attention feature map, generating a candidate region of the target by adopting a region generation network RPN
The goal of this step is to locate areas in the traffic scene where traffic signs may appear, and then the FL-CNN classifies the signs according to these areas. The relevant details of this step are as follows:
(1) on the attention characteristic graph output by the last depth separation convolution, taking each pixel as an anchor point, and adopting 3 sizes (4, 8 and 16) and 3 proportions (1:1, 1:2 and 2:1) to generate 9 anchors on the original traffic scene;
(2) removing anchors frames that exceed the boundaries of the original input image;
(3) removing anchors frames which are repeated for a plurality of times by adopting a maximum value inhibition method, wherein the threshold value is 0.7;
(4) positive and negative samples were determined from the intersection ratio IoU of the anchor box to the true target in the sample, with IoU >0.7 being positive samples, IoU <0.3 being negative samples, and between 0.3 and 0.7 of the anchor box, again removed. Wherein IoU has the following calculation formula:
Figure GDA0002114862600000151
(5) according to the translation invariance, each anchor frame corresponds to an area suggestion frame on the fusion feature map;
(6) and after all the regional suggestion frames pass through the fully-connected layer of the regional generation network RPN, obtaining a target candidate region.
Step 5, introducing context area information to a target candidate area generated by the RPN, and enhancing the classification characteristics of the marks;
for the target candidate region given in step 4, only partial features of the traffic sign may be included, so the present invention proposes a context region information, and introduces the spatial neighboring features of the target candidate region to enhance the classification features of the sign. The method comprises the following specific steps:
(1) for convenience of description, a certain target candidate region is denoted as p ═ p (p)x,py,pw,ph) Wherein (p)x,py) Is the center of the region (p)w,ph) Representing the width and height of the region, using a scale factor on the final depth-separated convolution output attention profile
Figure GDA0002114862600000152
And
Figure GDA0002114862600000153
creating a context area
Figure GDA0002114862600000154
The center coordinates are the same as the corresponding target candidate region. The relationship between the context area and the candidate area can be described as follows, where i is the serial number of the context area;
Figure GDA0002114862600000161
(2) for each target candidate area and its context area, 7 parts are divided in the horizontal and vertical directions by means of RoI-Pooling, and maximum value Pooling downsampling processing is performed on each part, so that even if the areas are different in size, the output dimension is kept consistent, and 3 feature vectors with a fixed size of 7 × 7 × 512 are generated.
(3) In the spatial dimension, the feature vectors are connected in series to form a 3 multiplied by 7 multiplied by 512 feature vector;
(4) finally, the feature vectors formed in (3) are compressed to 7 × 7 × 512 using a convolution of 1 × 1. The step not only enables the dimensionality of the feature vector introduced into the context area to meet the node requirement of the full connection layer. It is noted that the non-linear relationship between the background and the target can be learned using a 1 × 1 convolution, and when the introduced context area contains a complex background, the convolution parameters can suppress these backgrounds. Conversely, if local features of the object are introduced, the convolution parameters may enhance these features.
Step 6, sending the feature vectors into a full-connection layer for classification and regression, and outputting the classes and positions of the traffic signs;
two full Connected networks (FC) are adopted, the first FC Network is used for classifying traffic signs, wherein 4096 nodes are hidden layers, 44 nodes are output, each output node represents a traffic sign, the value range of the traffic sign is (0,1), and the largest output node is taken as the traffic sign category during classification; (ii) a The second FC network is used for position regression of the traffic sign, in which 4096 nodes are hidden layers, and 4 nodes are output, which respectively represent the coordinates and width of the center point of the traffic sign.
Step 7, establishing a attention loss function and training the FL-CNN model
In order to ensure that the model obtains full training and improve the generalization capability of the model, the invention provides a attention loss function, which realizes effective distinguishing of difficult and easy samples, inhibits the loss of easy-to-separate samples and enhances the loss of difficult-to-separate samples. The training of the FL-CNN model comprises two parts, namely an RPN network and a full-connection layer network, wherein the loss of the RPN network comprises two classification losses and regression losses, and the loss of the full-connection network comprises a multi-classification loss and a regression loss.
(1) The attention loss function of the RPN network is as follows:
Figure GDA0002114862600000171
wherein p isiIndicating the predicted probability that the ith anchor is the target object,
Figure GDA0002114862600000172
a real tag representing the object. t is tiIs a vector containing the center coordinates and width and height information of the prediction box.
Figure GDA0002114862600000173
An information vector representing a real box. N is a radical ofclsDenotes the total number of anchors, NregThe size of the characteristic diagram is shown, and lambda is an adjusting coefficient and can be taken
Figure GDA0002114862600000174
LregRepresents the regression loss of all the bounding boxes,
Figure GDA0002114862600000175
the attention two-classification loss provided by the invention is specifically defined as follows:
Figure GDA0002114862600000176
wherein sigma is sigmoid function, the prediction probability of foreground sample is-log sigma (x), the prediction probability of background sample is-log sigma (-x), and K is constant. The loss function has the following characteristics: if the sample belongs to an easily separable sample, — log σ (x) → 1 or-log σ (-x) → 0, that is, σ (x) → 1/e or σ (-x) → 1, then when K is taken as a large value, the foreground sample is lost
Figure GDA0002114862600000177
Loss accommodation coefficient σ (-Kx) → 0 in (1), background sample loss
Figure GDA0002114862600000178
Loss regulating coefficient σ (Kx) → 0. If the sample belongs to the difficultly-divided sample, the loss adjusting coefficients of the foreground sample and the background sample are respectively as follows: σ (-Kx) → 1, σ (Kx) → 1. Therefore, the attention loss function provided by the invention can effectively distinguish the difficult and easy samples, and enables the RPN to learn and train more attention and difficult samples by inhibiting the loss of the easy and easy samples, thereby effectively ensuring that the RPN network is sufficiently trained.
(2) The attention loss function for a full-connectivity layer network is as follows:
Figure GDA0002114862600000181
where δ is the softmax function. The loss function is similar to the RPN and comprises two parts of multi-classification loss and regression loss, wherein
Figure GDA0002114862600000182
Is the loss of attention multi-classification proposed by the present invention. The functional property of the method is consistent with the attention two-classification loss, and the prediction probability of the sample is-log delta (x)k) Weight δ (-Kx) → 1k) → 0, the predicted probability of inverse samples-log δ (x)k) Weight δ (-Kx) on → 0k)→1。
Step 8, repeating the second step to the seventh step to finish the sample training of the FL-CNN model;
and 9, starting the color camera to photograph the actual traffic scene, preprocessing the scene before inputting the model, setting the resolution to be 2048 multiplied by 2048, inputting the scene into the FL-CNN model, and repeating the steps two to six to finish the detection and identification of the traffic sign of the actual scene.
In the aspect of the learning algorithm of the model, a new attention loss function is invented in order to enable the model to obtain full learning from a sample set. The structure of the model mainly comprises the following 5 components, wherein (1), (2) and (4) are obviously different from other models and are respectively independently invented aiming at the low power consumption, weak hardware performance and improvement of detection and identification precision of a vehicle-mounted system)
(1) Lightweight convolutional neural networks: a convolutional neural network is constructed by utilizing a deep separation convolution technology, the number of model parameters is reduced, model compression is realized, and the convolution characteristics of the traffic sign are automatically extracted layer by layer for the traffic scene;
(2) channel-space attention module: the special technology for detecting the long-distance traffic sign is an important component in a model, can be embedded into each convolution layer of a light-weight convolutional neural network to construct a feature graph of interest, realizes the inhibition or enhancement of information in the original feature graph, effectively saves the calculation resources of the model and improves the detection capability of the method;
(3) regional suggested networks (RPN): generating a certain number of target suggestion areas according to the attention characteristic diagram output by the last depth separation convolutional layer;
(4) context area pooling layer: the method is a special technology capable of effectively enhancing the information capacity of a target area, is an important component in a model, and establishes classification characteristics of the information of the concerned context area through area pooling, regularization, concatenation and compression;
(5) fully Connected Network (FC): the section is mainly responsible for carrying out specific classification and position calculation on the compressed attention context area characteristics.
Example (b):
the resolution of each traffic scene in the model training data set Tsinghua-Tencent 100K is 2048 multiplied by 2048 pixels, the size of the traffic sign is between 0 pixel and 32 pixels, and between 32 pixel and 96 pixel, the traffic sign respectively accounts for 41.6 percent and 49.1 percent of the data set, namely 90.7 percent of the traffic sign accounts for less than 1 percent of the traffic scene, and the method belongs to the situation of long-distance traffic sign detection and identification.
(1) Data set processing: in the training process of the FL-CNN model adopted by the invention, in order to keep the balance of the sample set, a resampling method is adopted for less than 100 classes of traffic scenes where each class of traffic signs are located. The ratio of the training set to the test set is 1: 2;
(2) and (3) comparison indexes: in the specific testing process, a measurement index F1-measure commonly used for measuring accuracy is used as a detection identification index, and the higher the value of the index is, the higher the detection identification precision is;
(3) comparing models: in order to verify the effectiveness of the FL-CNN, the detection precision of the FL-CNN is compared with that of Fast R-CNN and Fast R-CNN which are the most common target detection frames at present;
(4) in addition, in order to verify the detection accuracy of the traffic sign with different distances in the invention in detail, the detection accuracy of the traffic sign with different sizes (different distances) in the invention is compared respectively, and the traffic sign with three sizes (0,32 pixels, (32,96 pixels), (96,200 pixels) is included, wherein the traffic sign with (0,32 pixels and (32,96 pixels) belongs to a long distance, and the traffic sign with (96,200 pixels) belongs to a medium distance.
FL-CNN model training and field test condition description (the test is to master the feasibility of the 1 st to 8 th steps of the method)
Firstly, the hard disk space storage occupied by each model after being exported is compared. Wherein, Fast R-CNN and Fast R-CNN are VGG-16 networks, and the parameters are respectively 558Mb and 582Mb after being derived, while the FL-CNN model of the invention only accounts for 76 Mb. Compared with Fast R-CNN and Faster R-CNN, the space storage of the FL-CNN model is reduced by about 80%, and the FL-CNN model can be directly operated in a vehicle-mounted system with low power consumption and hardware performance. Therefore, the invention adopts the deep separation convolution to effectively reduce the number of model parameters and the storage of the hard disk space. Meanwhile, the channel-space attention module and the attention context area of the invention have little influence on the storage of the model, and can effectively save the computing resources.
Secondly, the traffic sign detection and identification accuracy of the FL-CNN of the invention is compared with that of the common target detection frameworks Fast R-CNN and Fast R-CNN. As is clear from Table 1, the detection and identification precision of the traffic sign at three different distances is obviously improved compared with Fast R-CNN and Fast R-CNN, wherein compared with the Fast R-CNN method, 55 percentage points are increased for (0, 32) pixels, 20 percentage points and 9 percentage points are respectively increased for (0, 32) pixels and (96, 200) pixels, and 28 percentage points are increased for the total average, which fully illustrates the effectiveness of three attention mechanisms (channel-space attention, context attention and loss function attention).
Then, to further verify the influence of the three attention mechanisms on the detection and identification accuracy of the FL-CNN model, we also compare the accuracy after removing the different attention mechanisms in table 1, where "-" indicates that the mechanism is removed from the model. It can be seen from the results that removing each attention mechanism has an impact on the detection recognition accuracy, wherein the channel-space attention impact is the largest. In addition, after the three mechanisms are removed, the FL-CNN model degenerates into the Faster R-CNN model.
TABLE 1 comparison of detection and recognition accuracy (%)
Figure GDA0002114862600000201
Finally, FIG. 5 shows a comparison statistical chart of recognition accuracy of 44 traffic signs by different target detection recognition frameworks, and it can be seen that the detection recognition accuracy of each traffic sign of FL-CNN is higher than that of Fast R-CNN and that of Fast R-CNN.

Claims (3)

1. A remote traffic sign detection and identification method suitable for a vehicle-mounted system is characterized by comprising the following steps:
firstly, preprocessing a traffic sign image sample set;
constructing a lightweight convolution neural network to complete the convolution feature extraction of the traffic sign;
(1) the method comprises the steps that the combined mapping of a channel and a space of the original VGG-16 standard convolution is separated into a single mapping mode of the channel and the space by utilizing deep separation convolution, the parameter number of a model and the storage of hard disk space are reduced, a lightweight convolutional neural network comprises 5 convolutional layers in total, each convolutional layer comprises two parts of deep convolution and point convolution, and ReLU is used as an activation function;
(2) in the lightweight convolutional neural network, the computation amount of the deep separation convolution and the original standard convolution is compared as follows:
let the convolution kernel be (D)K,DKC), wherein D)KThe width and height of a convolution kernel, and C is a channel of the convolution kernel; in the convolution calculation process, the depth separation convolution is N (D) calculated by the original cross-channelK,DKM) standard convolution conversion to M (D)K,DK1) and N (1,1, M) point convolutions across channels, wherein a depth convolution is a single channel computation and a point convolution is a cross-channel computation; the input feature map is recorded as { DF,DFM, the output characteristic diagram is { D }F,DFN }, wherein DFRepresenting the width and height of the feature map, the amount of computation for each convolution is as follows:
the calculated amount of the standard convolution is: count _ s ═ DK×DK×M×N×DF×DF
The calculation amount of the deep convolution is as follows: count _ D ═ DK×DK×M×DF×DF
And the calculated amount of the point convolution is as follows: count _ p ═ mxnxdF×DF
Therefore, the comparison relationship between the computation amounts of the deep separation convolution and the standard convolution is as follows, and the computation amount is reduced compared with the original standard convolution by using the deep separation convolution every time
Figure FDA0003576644920000011
Doubling;
Figure FDA0003576644920000012
constructing an attention characteristic diagram through a channel-space attention module embedded into the lightweight convolutional neural network;
the method comprises the steps of embedding a channel-space attention module into deep separation convolutional layers, and performing characteristic attention of two dimensions of a channel and a space on output characteristic graphs of each deep separation convolutional layer; wherein the channel attention is most meaningful in paying attention to 'what' in an image by using the interrelation and the importance degree among channels; spatial concerns are the location features of the target in the image, "where" is more effective for image detection identification;
the output characteristic diagram of a certain depth separation convolution layer is U ═ U [ [ U ] ]1,u2,…,uC],
Figure FDA0003576644920000021
C is the number of channels,
Figure FDA0003576644920000022
the real number matrix is adopted, H and W are respectively the height and width of the feature map, and after passing through the channel-attention model, the constructed attention feature map is Y ═ Y1,y2,…,yC],
Figure FDA0003576644920000023
The specific calculation process of the process is as follows:
step 3.1 channel attention force
Firstly, compressing the spatial dimension of each channel in a feature map along the channel direction, and aggregating spatial information by using three global pooling modes of maximum, average and random respectively, wherein the maximum and average pooling respectively retains the texture and background features of an image, and the random pooling is between the maximum and average pooling;
first, according to the global maximum pooling, each u is respectivelyiCompression to channel attention mask component
Figure FDA0003576644920000024
It is defined as:
Figure FDA0003576644920000025
compressing each u according to global average pooling and global random poolingiTo channel attention mask component SmeanAnd SstoWherein it is defined as:
Figure FDA0003576644920000026
Figure FDA0003576644920000027
wherein
Figure FDA0003576644920000028
Secondly, three channel attention force mask components constructed by pooling compression are respectively used as the input of the multilayer perceptron model, aggregation is completed through point-by-point multiplication, accumulation and activation functions of weight parameters and mask components, nonlinearity is further increased, and the channel attention force mask S of the characteristic diagram U is [ S ═ S [1,s2,…,sC]Is defined as follows:
S=σ(W1δ(W0Smax)+W1δ(W0Smean)+W1δ(W0Ssto))
where σ is sigmoid function, δ is ReLu function, W0And W1For multi-layer perceptron modelWeights of types that are shared for the three channel attention mask components;
finally, expanding the channel attention mask to the original input feature graph U, calibrating the weight of each channel in the feature graph U according to the mask, and recording a new feature graph after channel attention mapping as
Figure FDA0003576644920000029
The specific definition is as follows:
Figure FDA0003576644920000031
step 3.2 spatial attention
In order to calculate the spatial attention mapping and construct the characteristic relationship between pixels or regions in the characteristic diagram, the channel number of the characteristic diagram is firstly compressed, and a group of pointwise convolutions are adopted to carry out the original input characteristic diagram of the depth separation convolution layer
Figure FDA0003576644920000032
Cross-channel aggregation, the aggregated feature map is recorded as
Figure FDA0003576644920000033
Secondly, because the characteristics of the feature maps of different layers are greatly different, the shallow feature map has higher resolution, and the deep feature map has opposite resolution and contains more abstract semantic features; therefore, when the spatial attention is constructed, in order to reduce parameters and calculation amount, the mask of the spatial attention is respectively carried out according to regions and pixels in the shallow layer characteristic diagram and the deep layer characteristic diagram; spatial attention mask of
Figure FDA0003576644920000034
It is defined as follows:
N=Softmax(Conv(M,o,k,s,p))
wherein Conv (·) represents convolution operation, an output channel o is 1, a convolution kernel size k of the shallow convolution layer is 1, a convolution kernel size k of the deep convolution layer is 3, s is 1 and p is 0, which are respectively step length and filling of convolution, and furthermore, in order to eliminate different influences of different feature map scales, a spatial attention mask is normalized by utilizing a Softmax function;
step 3.3 attention force feature map
Feature maps after channel attention force mapping
Figure FDA0003576644920000035
Based on the obtained data, performing spatial attention mask again
Figure FDA0003576644920000036
According to the spatial attention mask, the spatial attention is calibrated for each channel of the characteristic diagram X, and finally the output characteristic diagram of the deep separation convolutional layer is generated
Figure FDA0003576644920000037
As input to the next depth-separated convolutional layer, it is defined as:
Figure FDA0003576644920000038
wherein
Figure FDA0003576644920000039
Represents point-by-point multiplication;
fourthly, on the basis of the attention characteristic graph, generating a candidate region of the target by adopting a region generation network RPN
Positioning areas where traffic signs may appear in a traffic scene, and then classifying the signs according to the areas by the FL-CNN;
introducing context area information into a target candidate area generated by the RPN, and enhancing the classification characteristics of the marks;
and for the target candidate region given in the step four, only including partial features of the traffic sign, introducing the spatial neighboring features of the target candidate region to enhance the classification features of the sign, and specifically comprising the following steps:
(1) for convenience of description, a certain target candidate region is denoted as p ═ p (p)x,py,pw,ph) Wherein (p)x,py) Is the center of the region (p)w,ph) Representing the width and height of the region, using a scale factor on the final depth-separated convolution output attention profile
Figure FDA0003576644920000041
And
Figure FDA0003576644920000042
creating a context area
Figure FDA0003576644920000043
The center coordinates are the same as the corresponding target candidate region, and the relationship between the context region and the candidate region can be described as follows, wherein i is the serial number of the context region;
Figure FDA0003576644920000044
(2) for each target candidate area and its context area, dividing into 7 parts in horizontal and vertical directions by RoI-Pooling, and performing maximum value Pooling down-sampling processing on each part, wherein even if the areas are different in size, the output dimension is kept consistent, and 3 feature vectors with fixed size of 7 × 7 × 512 are generated;
(3) in the spatial dimension, the feature vectors are connected in series to form a 3 multiplied by 7 multiplied by 512;
(4) compressing the feature vectors formed in the convolution pair (3) of 1 × 1 to 7 × 7 × 512, enabling the dimensions of the feature vectors introduced into the context region to meet the node requirements of the full-connection layer, learning the nonlinear relation between the background and the target by utilizing the convolution of 1 × 1, when the introduced context region contains complex backgrounds, the convolution parameters restrain the backgrounds, and on the contrary, if the local features of the target are introduced, the convolution parameters strengthen the features;
step six, sending the feature vectors into a full-connection layer for classification and regression, and outputting the classes and positions of the traffic signs;
two full-Connected networks (FC) are adopted, the first FC Network is used for classifying traffic signs, wherein 4096 nodes are hidden layers, 44 nodes are output, each output node represents a traffic sign, the value range of the traffic sign is (0,1), and the largest output node is taken as the traffic sign category during classification; the second FC network is used for position regression of the traffic sign, wherein the hidden layer is 4096 nodes, 4 nodes are output and respectively represent the coordinate and the width of the center point of the traffic sign;
step seven, establishing a attention loss function and training the FL-CNN model
In order to ensure that the model obtains full training and improve the generalization capability of the model, an attention loss function is established, the difficult and easy samples are effectively distinguished, the loss of the easy-to-separate samples is inhibited, and the loss of the difficult-to-separate samples is enhanced; the FL-CNN model training comprises two parts, namely an RPN network and a full-connection layer network, wherein the loss of the RPN network comprises two classification losses and regression losses, and the loss of the full-connection network comprises multi-classification loss and regression loss;
(1) the attention loss function of the RPN network is as follows:
Figure FDA0003576644920000051
wherein p isiIndicating the predicted probability that the ith anchor is the target object,
Figure FDA0003576644920000052
real tag representing an object, tiIs a vector containing the center coordinates of the prediction box, width and height information,
Figure FDA0003576644920000053
information vector representing a real box, NclsDenotes the total number of anchors, NregSize of the characteristic diagram, and lambda is an adjustment coefficientGet it
Figure FDA0003576644920000054
LregRepresents the regression loss of all the bounding boxes,
Figure FDA0003576644920000055
is attention two classification loss, specifically defined as follows:
Figure FDA0003576644920000056
where σ is sigmoid function, the prediction probability of foreground sample is-log σ (x), the prediction probability of background sample is-log σ (-x), and K is constant, the loss function has the following characteristics: if the sample belongs to an easily separable sample, — log σ (x) → 1 or-log σ (-x) → 0, that is, σ (x) → 1/e or σ (-x) → 1, then when K is taken as a large value, the foreground sample is lost
Figure FDA0003576644920000057
Loss accommodation coefficient σ (-Kx) → 0 in (1), background sample loss
Figure FDA0003576644920000058
Loss adjusting coefficient σ (Kx) → 0; if the sample belongs to the difficultly-divided sample, the loss adjusting coefficients of the foreground sample and the background sample are respectively as follows: sigma (-Kx) → 1 and sigma (Kx) → 1, so that the attention loss function effectively distinguishes the difficult and easy samples, and the RPN learning and training are enabled to pay more attention to the difficult and easy samples by inhibiting the loss of the easy and easy samples, so that the RPN network is ensured to be fully trained;
(2) the attention loss function for a full-connection layer network is as follows:
Figure FDA0003576644920000059
where δ is the softmax function, the loss function is similar to RPN, including multiple divisionsClass loss and regression loss, wherein
Figure FDA0003576644920000061
Is concerned with multi-classification loss, the functional property of which is consistent with concerned with two-classification loss, when the prediction probability of the sample is-log delta (x)k) Weight δ (-Kx) → 1k) → 0, predicted probability of opposite sample-log δ (x)k) Weight δ (-Kx) on → 0k)→1;
Step eight, repeating the step two to the step seven to finish the sample training of the FL-CNN model;
and step nine, starting the color camera to photograph the actual traffic scene, preprocessing the scene before inputting the model, setting the resolution to 2048 multiplied by 2048, inputting the scene into the FL-CNN model, and repeating the step two to the step six to finish the detection and identification of the traffic sign of the actual scene.
2. The method for detecting and identifying remote traffic signs suitable for use in vehicle-mounted systems according to claim 1, wherein: the first step is specifically as follows:
(1) adopting a Tsinghua university and Tencent 100K data set jointly published by Tencent corporation, and selecting 44 types of common traffic signs as remote detection and identification objects;
(2) dividing the Tsinghua-Tencent 100K data set into a training set and a testing set according to the proportion of 1: 2;
(3) in order to ensure the sample balance during the FL-CNN model training, the scene instances of each type of traffic signs in the training set are more than 100, and if the scene instances of a certain type of signs are less than 100, a repeated sampling method is adopted for filling.
3. The method for detecting and identifying remote traffic signs suitable for use in vehicle-mounted systems according to claim 2, wherein: the fourth step is specifically as follows:
(1) on an attention characteristic graph output by the last depth separation convolution, taking each pixel as an anchor point, and adopting 3 proportions and 3 sizes of 1:1, 1:2 and 2:1 to generate 9 anchors on an original traffic scene, wherein the 3 sizes are 4, 8 and 16 respectively;
(2) removing anchors frames that exceed the boundaries of the original input image;
(3) removing anchors frames which are repeated for a plurality of times by adopting a maximum value inhibition method, wherein the threshold value is 0.7;
(4) from the intersection ratio IoU of the anchor box to the true target in the sample, positive and negative samples were determined, where IoU >0.7 are positive samples, IoU <0.3 are negative samples, and between 0.3 and 0.7 the anchor box was removed again, where IoU is calculated as follows:
Figure FDA0003576644920000071
(5) according to the translation invariance, each anchor frame corresponds to an area suggestion frame on the fusion feature map;
(6) and all the regional suggestion frames pass through a fully connected layer of the regional generation network RPN to obtain target candidate regions.
CN201910474059.7A 2019-06-02 2019-06-02 Remote traffic sign detection and identification method suitable for vehicle-mounted system Active CN110188705B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910474059.7A CN110188705B (en) 2019-06-02 2019-06-02 Remote traffic sign detection and identification method suitable for vehicle-mounted system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910474059.7A CN110188705B (en) 2019-06-02 2019-06-02 Remote traffic sign detection and identification method suitable for vehicle-mounted system

Publications (2)

Publication Number Publication Date
CN110188705A CN110188705A (en) 2019-08-30
CN110188705B true CN110188705B (en) 2022-05-06

Family

ID=67719645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910474059.7A Active CN110188705B (en) 2019-06-02 2019-06-02 Remote traffic sign detection and identification method suitable for vehicle-mounted system

Country Status (1)

Country Link
CN (1) CN110188705B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111027670B (en) * 2019-11-04 2022-07-22 重庆特斯联智慧科技股份有限公司 Feature map processing method and device, electronic equipment and storage medium
CN110956115B (en) * 2019-11-26 2023-09-29 证通股份有限公司 Scene recognition method and device
CN111178153A (en) * 2019-12-09 2020-05-19 武汉光庭信息技术股份有限公司 Traffic sign detection method and system
CN110843794B (en) * 2020-01-15 2020-05-05 北京三快在线科技有限公司 Driving scene understanding method and device and trajectory planning method and device
CN111291887B (en) * 2020-03-06 2023-11-10 北京迈格威科技有限公司 Neural network training method, image recognition device and electronic equipment
CN111539524B (en) * 2020-03-23 2023-11-28 字节跳动有限公司 Lightweight self-attention module and searching method of neural network framework
CN111652308B (en) * 2020-05-13 2024-02-23 三峡大学 Flower identification method based on ultra-lightweight full convolutional neural network
CN112001385B (en) * 2020-08-20 2024-02-06 长安大学 Target cross-domain detection and understanding method, system, equipment and storage medium
CN112016467B (en) * 2020-08-28 2022-09-20 展讯通信(上海)有限公司 Traffic sign recognition model training method, recognition method, system, device and medium
CN112464959B (en) * 2020-12-12 2023-12-19 中南民族大学 Plant phenotype detection system and method based on attention and multiple knowledge migration
WO2022205685A1 (en) 2021-03-29 2022-10-06 泉州装备制造研究所 Lightweight network-based traffic sign recognition method
CN112926274A (en) * 2021-04-15 2021-06-08 成都四方伟业软件股份有限公司 Method and device for simulating urban traffic system by using convolutional neural network
CN113033482B (en) * 2021-04-20 2024-01-30 上海应用技术大学 Traffic sign detection method based on regional attention
CN113536942B (en) * 2021-06-21 2024-04-12 上海赫千电子科技有限公司 Road traffic sign recognition method based on neural network
CN113536943B (en) * 2021-06-21 2024-04-12 上海赫千电子科技有限公司 Road traffic sign recognition method based on image enhancement
CN113591931A (en) * 2021-07-06 2021-11-02 厦门路桥信息股份有限公司 Weak supervision target positioning method, device, equipment and medium
CN113963060B (en) * 2021-09-22 2022-03-18 腾讯科技(深圳)有限公司 Vehicle information image processing method and device based on artificial intelligence and electronic equipment
CN113673541B (en) * 2021-10-21 2022-02-11 广州微林软件有限公司 Image sample generation method for target detection and application
US20230153962A1 (en) * 2021-11-12 2023-05-18 Huawei Technologies Co., Ltd. System and methods for multiple instance segmentation and tracking
CN114821519A (en) * 2022-03-21 2022-07-29 上海应用技术大学 Traffic sign identification method and system based on coordinate attention
CN116664918A (en) * 2023-05-12 2023-08-29 杭州像素元科技有限公司 Method for detecting traffic state of each lane of toll station based on deep learning
CN116954264B (en) * 2023-09-08 2024-03-15 杭州牧星科技有限公司 Distributed high subsonic unmanned aerial vehicle cluster control system and method thereof
CN117726958A (en) * 2024-02-07 2024-03-19 国网湖北省电力有限公司 Intelligent detection and hidden danger identification method for inspection image target of unmanned aerial vehicle of distribution line

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537393A (en) * 2015-01-04 2015-04-22 大连理工大学 Traffic sign recognizing method based on multi-resolution convolution neural networks
CN107301383A (en) * 2017-06-07 2017-10-27 华南理工大学 A kind of pavement marking recognition methods based on Fast R CNN
CN107316001A (en) * 2017-05-31 2017-11-03 天津大学 Small and intensive method for traffic sign detection in a kind of automatic Pilot scene
CN108446625A (en) * 2018-03-16 2018-08-24 中山大学 The important pedestrian detection method of picture based on graph model
CN108710826A (en) * 2018-04-13 2018-10-26 燕山大学 A kind of traffic sign deep learning mode identification method
CN108960308A (en) * 2018-06-25 2018-12-07 中国科学院自动化研究所 Traffic sign recognition method, device, car-mounted terminal and vehicle
CN108985145A (en) * 2018-05-29 2018-12-11 同济大学 The Opposite direction connection deep neural network model method of small size road traffic sign detection identification

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11106925B2 (en) * 2018-10-25 2021-08-31 Intel Corporation Computer-assisted or autonomous driving traffic sign recognition method and apparatus

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537393A (en) * 2015-01-04 2015-04-22 大连理工大学 Traffic sign recognizing method based on multi-resolution convolution neural networks
CN107316001A (en) * 2017-05-31 2017-11-03 天津大学 Small and intensive method for traffic sign detection in a kind of automatic Pilot scene
CN107301383A (en) * 2017-06-07 2017-10-27 华南理工大学 A kind of pavement marking recognition methods based on Fast R CNN
CN108446625A (en) * 2018-03-16 2018-08-24 中山大学 The important pedestrian detection method of picture based on graph model
CN108710826A (en) * 2018-04-13 2018-10-26 燕山大学 A kind of traffic sign deep learning mode identification method
CN108985145A (en) * 2018-05-29 2018-12-11 同济大学 The Opposite direction connection deep neural network model method of small size road traffic sign detection identification
CN108960308A (en) * 2018-06-25 2018-12-07 中国科学院自动化研究所 Traffic sign recognition method, device, car-mounted terminal and vehicle

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ADCM:attention dropout convolutional module;Zhigang Liu 等;《Neurocomputing》;20200621;第394卷;95-104 *
MR-CNN: A Multi-Scale Region-Based Convolutional Neural Network for Small Traffic Sign Recognition;Zhigang Liu 等;《IEEE Access》;20190429;第7卷;57120 - 57128 *
Traffic Sign Recognition Using an Attentive Context Region-Based Detection Framework;LIU Zhigang 等;《Chinese Journal of Electronics》;20211130;第30卷(第6期);1080-1086 *
基于轻量型卷积神经网络的交通标志识别方法;程越 等;《计算机系统应用》;20200215;第29卷(第2期);198-204 *
车载辅助系统中禁令交通标志的识别研究;余进程;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20141015(第10期);I138-933 *

Also Published As

Publication number Publication date
CN110188705A (en) 2019-08-30

Similar Documents

Publication Publication Date Title
CN110188705B (en) Remote traffic sign detection and identification method suitable for vehicle-mounted system
CN110163187B (en) F-RCNN-based remote traffic sign detection and identification method
CN111598030B (en) Method and system for detecting and segmenting vehicle in aerial image
CN110135267B (en) Large-scene SAR image fine target detection method
CN112818903B (en) Small sample remote sensing image target detection method based on meta-learning and cooperative attention
Liu et al. ARC-Net: An efficient network for building extraction from high-resolution aerial images
CN114202672A (en) Small target detection method based on attention mechanism
CN111461083A (en) Rapid vehicle detection method based on deep learning
CN105550701A (en) Real-time image extraction and recognition method and device
CN111160249A (en) Multi-class target detection method of optical remote sensing image based on cross-scale feature fusion
CN106295613A (en) A kind of unmanned plane target localization method and system
CN115631344B (en) Target detection method based on feature self-adaptive aggregation
Zheng et al. A lightweight ship target detection model based on improved YOLOv5s algorithm
Zang et al. Traffic lane detection using fully convolutional neural network
Lu et al. A CNN-transformer hybrid model based on CSWin transformer for UAV image object detection
CN114049572A (en) Detection method for identifying small target
Liu et al. CAFFNet: channel attention and feature fusion network for multi-target traffic sign detection
CN116524189A (en) High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization
CN116597326A (en) Unmanned aerial vehicle aerial photography small target detection method based on improved YOLOv7 algorithm
CN114332921A (en) Pedestrian detection method based on improved clustering algorithm for Faster R-CNN network
Wu et al. Vehicle detection based on adaptive multi-modal feature fusion and cross-modal vehicle index using RGB-T images
CN112668662B (en) Outdoor mountain forest environment target detection method based on improved YOLOv3 network
CN117152414A (en) Target detection method and system based on scale attention auxiliary learning method
CN115223017B (en) Multi-scale feature fusion bridge detection method based on depth separable convolution
Zhang et al. Point clouds classification of large scenes based on blueprint separation convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant