CN113780187A

CN113780187A - Traffic sign recognition model training method, traffic sign recognition method and device

Info

Publication number: CN113780187A
Application number: CN202111071360.7A
Authority: CN
Inventors: 陈哲; 程艳云
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-09-13
Filing date: 2021-09-13
Publication date: 2021-12-10

Abstract

The invention provides a traffic sign recognition model training method, a traffic sign recognition method and a device, wherein the traffic sign recognition method greatly avoids the detection time length by utilizing the recognition characteristic of a single-stage detection algorithm in a deep learning neural network; the traffic sign recognition model training method is based on an FCOS algorithm, and reduces the phenomenon of imbalance of positive and negative samples by introducing a attention mechanism CBAM; a swish function is introduced to avoid a certain characteristic loss phenomenon; and features of different scales of the features are protected through multi-scale feature fusion. The traffic sign identification method can realize real-time detection and identification of the traffic sign in a real road scene, and solves the problems of small detection object target, natural environment interference, real-time identification and the like in the detection.

Description

Traffic sign recognition model training method, traffic sign recognition method and device

Technical Field

The invention belongs to the technical field of artificial intelligence, and mainly relates to a traffic sign recognition model training method, a traffic sign recognition method and a traffic sign recognition device.

Background

At present, automatic driving is one of the best ways to solve the social problems in the traffic industry. The traffic sign recognition is used as an important part of automatic driving and mainly aims to position and classify traffic signs encountered in the driving process and provide real-time decision support for an intelligent traffic system. Although general object recognition has achieved good results in the PASCAL VOC data set and the COCO data set in the last years, since the traffic signs are smaller than the objects in general natural scenes, these general object recognition can be rarely applied directly to the task of traffic sign recognition, and the problems to be overcome by the current traffic sign recognition technology are mainly geometric distortion caused by the shooting angle, occlusion, traffic sign deformation, motion blur caused by high-speed vehicle motion, and the like.

In recent years, most of traffic sign recognition schemes are based on Convolutional Neural Networks (CNNs), and unlike the characteristics of artificial design (hand-crafted), CNNs can learn general characteristics from a large number of samples without preprocessing, so that the traffic sign recognition schemes are more robust, and are also researched based on convolutional neural networks.

Aiming at the problem of detecting and classifying traffic signals in street scenes, a detection framework based on an attention model is provided. The method designs an attention model into a candidate region with a proper proportion so as to be capable of better locating and classifying small targets, and although 86.8% of F1-measure is achieved, the detection speed is only 3.85 FPS; for solving the problem of small object feature loss, some researchers have proposed a method of a cascade mask-based small object detection framework, which first discards the background area from low to high for a plurality of resolution images in a cascade manner, and then detects the object in the foreground area by using fast RCNN, with a detection speed of only 9.6 FPS.

The above recognition algorithms all adopt a two-stage recognition method. The current CNN-based target identification is divided into two-stage target identification and single-stage target identification, the main difference between the two is the generation of a suggestion (prompt) frame, and the two-stage method has high accuracy but low speed due to the fact that a region generation network RPN (region prompt network) plays a role in screening, cannot meet a real-time identification scene, and is very unfriendly for the target identification applied to the field of automatic driving.

Nowadays, with feature fusion and continuous optimization of a loss function, the accuracy of single-stage detection can achieve a similar effect of two stages. Some scholars adopt single-stage detection algorithms SSD and YOLOv3 to detect the traffic signs, the F1-measure respectively reaches 62.8 percent and 70.7 percent, and the detection speed respectively reaches 19.23FPS and 22.22 FPS. Compared with the double-stage detection, the speed is greatly improved, but the precision is not high; in 2020, the scholars used enhanced YOLOv3(MSA _ YOLOv3) for traffic sign detection. According to the method, a multi-scale space pyramid collection block is introduced, a bottom-up enhanced path is designed, accurate positioning of a target is achieved by effectively utilizing fine-grained features of a lower layer, F1-measure of the method is 81.8%, and the detection speed reaches 23.81 FPS. Although the accuracy of the method is improved, the method greatly increases the complexity of model calculation.

Disclosure of Invention

The invention aims to provide a traffic sign recognition model training method, a traffic sign recognition method and a device, so as to improve the accuracy and speed of traffic sign recognition.

The invention relates to a traffic sign recognition model training method, wherein the design scheme not only emphasizes the recognition precision, but also has requirements on the recognition speed, and the recognition precision is realized by mainly carrying out effective reduction on the original algorithm, carrying out necessary simplification on the analysis process and introducing a more optimized algorithm on the premise of ensuring the correctness of the algorithm; the latter introduces some lightweight modules on the basis of the original algorithm, and improves the recognition accuracy on the premise of not influencing the recognition speed.

Compared with general object identification, the traffic sign identification has a large amount of small target identification, so that the effect of detecting the traffic sign by using the FCOS algorithm alone is not ideal; therefore, the detection method adopts an anchor-free single-stage detection algorithm based on a Full Convolution Network (FCN), the algorithm firstly sends the preprocessed pictures into a backbone network to carry out feature extraction to obtain a feature map, then carries out pixel-level regression operation on the feature map, thereby carrying out network training to obtain a network model, and then tests the image by using the trained network model to obtain a final prediction result.

In order to achieve a recognition level close to that of general object recognition, the invention carries out the following improvement based on the FCOS algorithm: an Attention mechanism CBAM (conditional Block Attention Module) is introduced into a feature extraction network ResNet-50, effective features are utilized to a greater extent, and the phenomenon of imbalance of positive and negative samples is reduced; the swish function is introduced, so that the phenomenon of feature loss to a certain degree is avoided, and the identification efficiency is improved; in addition, the characteristics of different layers are protected by adding a lightweight multi-scale characteristic fusion on the characteristic enhancement network, and great help is brought to the aspects of improving the detection precision and speed; the fuzzy samples are processed because the target frames of the data set used in the original FCOS algorithm are overlapped, but the fuzzy samples of the data set cannot be processed, so the processing is effectively deleted, and unnecessary calculation is reduced.

In a first aspect, the invention provides a traffic sign recognition model training method, wherein the model is based on an FCOS algorithm, an attention mechanism CBAM (CBAM) is introduced into a ResNet-50 network to enhance the primary extraction of features, a swish function capable of reducing gradient disappearance is added into the feature network, then the primary extracted network is subjected to feature enhancement and feature fusion of different scales, and finally, the features are subjected to regression and classification to obtain a training model, and the training method specifically comprises the following steps:

the method comprises the following steps: inputting a sample data set, setting a positive sample threshold, an epoch and a batch size batchsize for training, initializing by using ImageNet pre-training weight, and setting the size of an input picture, wherein ImageNet is a large visual database for visual object recognition software research;

step two: taking ResNet-50 as a backbone network to extract the characteristics of the pictures in the traffic sign data set, preferably adopting a Qinghua traffic sign database TT100K, adding CABM modules into the first layer and the last layer of the ResNet-50 network, and adjusting the dimensions of the characteristic graph to obtain C3, C4 and C5 characteristic graphs; this is because the FCOS algorithm is based on single-stage (one-stage) target detection, and interference of natural scene objects in the data set occurs, and therefore, a large amount of class imbalance phenomena may occur in feature extraction, and therefore, the attention mechanism module CBAM is introduced into the ResNet-50 network to suppress the phenomena, and the CBAM is divided into a channel attention module (channel attention module) and a spatial attention module (spatial attention module), and mainly functions to focus on valid features and suppress invalid features. The best performance can be achieved by the channel attention module and the spatial attention module. In addition, the dimension of the characteristic diagram is adjusted by the size and the step length of the convolution kernel of each layer, so that the characteristic reinforcement is convenient to perform in the next step, and after the operations, the characteristic diagrams of C3, C4 and C5 are finally obtained;

adopting a swish activation function to replace a ReLu activation function in the improved ResNet-50 network; the swish function has the characteristics of no upper bound, low bound, smoothness and nonmonotonicity, has the functions of reducing gradient explosion and gradient disappearance, and is superior to the ReU function in model effect;

step three: after each layer of features extracted by ResNet-50, putting the features into a feature pyramid network FPN for feature fusion, wherein P3, P4 and P5 in the FPN are connected from top to bottom through effective feature layers obtained by C3, C4 and C5 feature maps and 1 × 1 convolution, P6 and P7 layers are high semantic feature layers, and P6 and P7 are generated by applying a convolution layer with the step size of 2 through P5 and P6 respectively, so that the feature pyramid network FPN is formed;

step four: performing multi-stage feature fusion on the network in the step three, wherein Pi' (i ═ 3, 4, 5, 6, 7) is obtained by adjusting the number of channels through 1 × 1 convolution, and then performing up-sampling and down-sampling on each layer;

step five: and generating a sample through the characteristic information obtained in the step five, and classifying and regressing the sample.

After the operations of the first step to the fifth step, the traffic sign recognition model based on the FCOS algorithm can be obtained.

Further, the FCOS performs regression operation on the feature points on each feature map, predicts four values (l, r, t, d) which respectively represent the distances from the feature points to the upper, lower, left and right boundaries of a GT (ground Truth) frame; after the feature points are mapped to the original image, the original image is corresponding to a plurality of GT frames, the category of the original image cannot be accurately judged, and the feature points at the moment belong to the fuzzy sample; for a prediction frame with low quality possibly appearing far away from the target central point, in order to enhance the accuracy of central point selection, the model can also introduce center-less for inhibition; the loss function consists of a classification loss focal loss, a regression loss iou loss and a center-less loss BCE; in post-processing, bounding boxes for low scores are eventually filtered out at the non-maximal suppression (NMS).

In a second aspect, the present invention provides a traffic sign recognition method, where the obtained road traffic picture is placed in the trained model for prediction to obtain a final recognition result.

In a third aspect, the present invention further provides a traffic sign recognition apparatus, including:

a target image acquisition unit for acquiring a target image;

and the detection unit is used for processing the target image based on the trained model to obtain a traffic sign recognition result of the target image.

Has the advantages that:

the scheme of the invention provides a traffic sign identification method based on an FCOS algorithm, which aims to solve the problems of small identification object target, natural environment interference, identification instantaneity and the like existing in the identification of traffic signs in roads and improve the identification accuracy and instantaneity simultaneously, and the model and the method mainly have the following advantages:

1. in the aspect of real-time performance, the model algorithm adopts single-stage detection, compared with two-stage identification, the single-stage detection has no process of extracting the characteristics of the candidate region and only extracts the characteristics once, so that the identification speed is high;

2. in the aspect of accuracy, the CBAM attention mechanism is introduced to pay attention to effective characteristics and inhibit ineffective characteristics so as to solve the phenomenon of large-scale class imbalance in identification. Adding multi-scale feature fusion to enhance high-level semantic features and reduce loss of feature information;

3. in the feasibility aspect, the method comprises the steps of putting the picture into a convolution network, extracting and fusing features to obtain feature layers with different scales, and then carrying out classification and regression operation on the obtained features to finally obtain an identification result. The experimental environment required by the recognizer is two image cards of 2080ti in great, an operating system of Ubuntu18.04 is adopted, and certain feasibility is achieved by matching with CUDA10.1 and CUDNN7.6.3 of the image cards.

Drawings

FIG. 1 is a flow chart of a method for training a traffic sign recognition model in an embodiment of the present invention;

FIG. 2 is a diagram of a traffic sign recognition model architecture in accordance with an embodiment of the present invention;

fig. 3 is a diagram illustrating the recognition effect of a traffic sign according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example one

A real-time traffic sign recognition model training method based on FCOS algorithm improvement is shown in figure 1, and an overall model network structure is shown in figure 2, and specifically comprises the following steps:

the method comprises the following steps: initialization sets a positive sample threshold of 0.3, an epoch of training of 100, a batch size of 8, and initializes using ImageNet pre-training weights, and sets the number of short edge pixels of the input picture to 800, the number of long edge pixels being less than or equal to 1333.

Step two: the ResNet-50 is used as a main network to extract the characteristics of pictures in the TT100K data set (the Qinghua traffic sign database), CABM modules are added into the first layer and the last layer of the ResNet-50 network, and after the operations, characteristic graphs of C3, C4 and C5 are finally obtained;

and (3) replacing a ReLu activating function with a swish activating function in the improved ResNet-50 network, wherein the swish activating function is shown in a formula (1).

f(x)＝x·σ(βx) (1)

Step three: and (3) a feature enhancement network, namely after each layer of features extracted by ResNet-50, putting the features into a feature pyramid network FPN for feature fusion, wherein P3, P4 and P5 in the FPN are connected from top to bottom through effective feature layers obtained by C3, C4 and C5 feature maps and 1 × 1 convolution, P6 and P7 layers are high semantic feature layers, and P6 and P7 are respectively generated by applying a convolution layer with the step size of 2 through P5 and P6, so that the feature pyramid network FPN is formed.

Step four: and performing multi-stage feature fusion on the network in the step three, wherein Pi' (i ═ 3, 4, 5, 6, 7) is obtained by performing 1 × 1 convolution on Pi (i ═ 3, 4, 5, 6, 7) to adjust the number of channels, and then performing up-sampling and down-sampling on each layer, wherein the following formula (2) -formula (5):

step five: classification and regression: generating a sample through the characteristic information obtained in the step four, classifying and regressing, wherein the loss function is as the following formula (6):

wherein the classification loss L_clsIs the focal loss (focal loss), the regression loss L_regIs the IOU loss of UnitBox (an Advanced Object Detection network). N is a radical of_posRepresenting the number of positive samples, λ is the balance weight, which is used to balance the two penalties of classification and regression, set to 1, the summation calculation is performed over the entire feature map,

represents an indication function, if

It is 1, otherwise it is 0.

Among the loss functions, the focus loss function is obtained by adding a modulation factor (1-p) to a standard Cross Entropy loss (Cross Entropy loss) function_t)^γWherein, the standard cross entropy loss formula is as follows (7):

wherein y represents the label of the sample, and when the sample is a positive sample, y is 1, and when the sample is a negative sample, y is 0, and p ∈ [0,1] represents the prediction probability of the corresponding category when y is 1. Therefore, the focus loss function is as follows equation (8):

FL(p_t)＝-α_t(1-p_t)^γlog(p_t) (8)

gamma is the modulation parameter and gamma is ≧ 0, alpha is the balance variable where gamma is 2 and alpha is 0.25.

Wherein, in order to suppress the low quality sample, a center-ness is added to the focal length, ranging between [0,1], as shown in the following equation (9):

and finally filtering out the low-score bounding box in non-maximum suppression (NMS), and obtaining the training model after the five steps.

Example two

A traffic sign recognition method is characterized in that pictures needing to be recognized are placed in the trained model for prediction to obtain a final recognition effect.

In order to verify the prediction effect of the method provided by the invention, four different road scene pictures are randomly selected for identification, the identification effect picture is shown in fig. 3, and the identification time of each picture is 41 ms.

Experiments show that under the condition of not influencing the identification speed, the model disclosed by the invention is used for carrying out traffic sign identification and is compared with the conventional ResNet-50 network model, wherein the conventional ResNet-50 network model is used as a comparative example, and specifically, the model in the document He K, Zhang X, Ren S, et al. deep resource for Image registration [ C ]. Proceedings of the IEEE registration on component vision and mapping 778 is selected, compared with the model in the comparative example 1, the model has the accuracy improved by 2.3F1-measure and the number of frames transmitted per second (FPS) is the same as that in the comparative example 1, as shown in the following table:

to verify the recognition effect on speed, the present invention performs the same data set Detection as the existing fast-RCNN and SSD, and fast-RCNN (ref: Ren S, He K, Girshick R, et al. fast R-CNN: todards read-Time Object Detection with Region pro-active Networks [ J ]. IEEE Trans Pattern Engine inner, 2017,39(6):1137 and 1149.) and SSD (ref: Liu W, Anguelov D, et al. SSD: Single housing detector [ C ]// European con on computer vision. Springer, Cham,2016:21-37.) reach 8.33FPS and 19.23FPS, respectively, and FPS 19.63.63-3583% F, respectively, on speed, 84% and 84% respectively.

EXAMPLE III

The invention also provides a traffic sign recognition device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the program, the steps of the method are realized.

The specific examples described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made or substituted in a similar manner to the specific embodiments described herein by those skilled in the art without departing from the spirit of the invention or exceeding the scope thereof as defined in the appended claims.

It should be appreciated by those skilled in the art that the embodiments of the present invention may be provided as a system or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The description has been presented with reference to flowchart illustrations and/or block diagrams of systems, apparatuses (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present specification have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all changes and modifications that fall within the scope of the specification.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present specification without departing from the spirit and scope of the specification. Thus, if such modifications and variations of the present specification fall within the scope of the claims of the present specification and their equivalents, the specification is intended to include such modifications and variations.

Claims

1. A traffic sign recognition model training method is characterized by comprising the following steps:

the method comprises the following steps: acquiring a sample data set, wherein the sample data set comprises a plurality of training images, and the training images are marked with actual boundary boxes and actual categories;

step two: based on an FCOS algorithm, adopting ResNet-50 as a feature extraction backbone network, adding CABM modules in a first layer and a last layer of the ResNet-50 network, forming an improved ResNet-50 network by adopting a swish activation function, and adopting the improved ResNet-50 network to extract features of pictures in a data set;

step three: putting each layer of features extracted by ResNet-50 in the second step into a feature pyramid network FPN for feature fusion reinforcement;

step four: performing multi-level feature fusion on the features subjected to the feature fusion reinforcement in the third step;

step five: generating a sample through the feature information obtained in the step four, and finally performing regression and classification on the features to obtain a training model.

2. The training method of traffic sign recognition model according to claim 1, wherein the second step is preceded by an initialization operation, the initialization operation comprising: the positive sample threshold, the trained epoch, and the batch size are set and initialized with ImageNet pre-training weights.

3. The method for training a traffic sign recognition model according to claim 1, wherein in the second step, the loss function used in the classification and regression is:

wherein the classification loss L_clsIs the focal loss, the regression loss L_regIOU loss of UnitBOx; n is a radical of_posRepresenting the number of positive samples, λ is the balance weight, used to balance the two losses, classification and regression, set to 1,

representing the indicated function.

4. The method of claim 3, wherein the focus loss function is a modulation factor (1-p) added to the standard cross entropy loss function_t)^γWherein, the standard cross entropy loss formula is as follows:

wherein y represents the label of the sample, when p ∈ [0,1] represents the prediction probability of the corresponding class when y is equal to 1, the formula of the focus loss function is as follows:

FL(p_t)＝-α_t(1-p_t)^γlog(p_t)

gamma is the modulation parameter, and gamma is more than or equal to 0, and alpha is the balance variable.

5. The method of claim 4, wherein γ is 2 and α is 0.25.

6. The method of claim 4, wherein a centrality center-ness is added to the focus loss, in the range [0,1], and the formula is:

7. a traffic sign recognition method, comprising:

acquiring an image by using a camera, and processing the target image based on the recognition model obtained by the traffic sign recognition model training method according to any one of claims 1 to 6 to obtain a traffic sign recognition result of the target image.

8. A traffic sign recognition apparatus, comprising:

a target image acquisition unit for acquiring a target image;

a detecting unit, configured to process the target image based on the recognition model obtained by the traffic sign recognition model training method according to any one of claims 1 to 6, so as to obtain a traffic sign recognition result of the target image.