CN116597411A

CN116597411A - Method and system for identifying traffic sign by unmanned vehicle in extreme weather

Info

Publication number: CN116597411A
Application number: CN202310453945.8A
Authority: CN
Inventors: 吴晓明; 尹训嘉; 刘祥志; 邱文科; 裴加彬
Original assignee: Shandong Shanke Intelligent Technology Co ltd; Qilu University of Technology; Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Shandong Shanke Intelligent Technology Co ltd; Qilu University of Technology; Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-08-15

Abstract

The invention discloses a method and a system for identifying traffic signs by unmanned vehicles in extreme weather; wherein the method comprises the following steps: under extreme weather, acquiring a traffic sign graph to be identified; the extreme weather refers to rain and snow weather or haze weather; aiming at rain and snow weather, carrying out rain and snow removal pretreatment on the traffic sign map to be identified; aiming at haze weather, defogging pretreatment is carried out on a traffic sign graph to be identified; and identifying the preprocessed image by adopting a trained traffic sign identification model to obtain an identified traffic sign.

Description

Method and system for identifying traffic sign by unmanned vehicle in extreme weather

Technical Field

The invention relates to the technical field of image recognition, in particular to a method and a system for recognizing traffic signs by unmanned vehicles in extreme weather.

Background

The statements in this section merely relate to the background of the present disclosure and may not necessarily constitute prior art.

Along with the intelligent running of urban traffic, unmanned intelligent vehicles gradually develop to become the targets of people development and research, and the progress of unmanned intelligent vehicles is required to support the rapid development of traffic sign detection and identification, so that the intelligent vehicles can spread around in the near future. The unmanned traffic sign recognition refers to obtaining a road scene image by using a vehicle-mounted camera, and recognizing road signs and semantics on the image, and is an important key for unmanned judgment of current road indication. How to make the automobile automatically and accurately identify the road sign has important research significance. In image processing, image data quality is one of the key influencing factors. However, in the acquisition process, the sensing device is inevitably shielded to different degrees under extreme weather interference, the quality of the acquired image is often not required, and a plurality of inconveniences are caused for subsequent data analysis.

Chinese patent document CN114140766a discloses a method and apparatus for improving traffic sign recognition of YOLOv 3. The traffic sign is identified by using the trained YOLOv3 network model. The image is subjected to data enhancement processing, so that the generalization capability of the model can be improved while overfitting is effectively prevented, the channel attention enhancement depth network is introduced into the network model, the accuracy of traffic sign recognition can be improved, the efficiency of traffic sign recognition is improved, and the detection efficiency is lower under a real environment road.

Chinese patent document CN114998866a discloses a traffic sign recognition method based on improved YOLOv 4. The improved YOLOv4 network can improve the detection precision of the small target traffic sign on the premise of ensuring the detection speed, but has lower recognition rate under the condition of extreme weather interference.

Chinese patent document CN114821519a discloses a traffic sign recognition method and system based on coordinate attention, which balances the performance of detection speed and recognition accuracy, improves the feature extraction capability of a network and improves the detection effect on a blocking target and a small target, but has poor compatibility to complex road scenes.

The methods described in the patent documents have high requirements on experimental data, and have the common problem that the method has a large lifting space for coping with traffic signs under extreme weather interference. Because natural weather, illumination angle, complex background of actual road and the like can influence the recognition accuracy of the small target traffic sign in target detection, the problem of low recognition result accuracy and the like is caused.

The inventor finds that in the process of downsampling the feature map captured in real time by the feature extractor of the current mainstream when identifying traffic signs in extreme weather, the feature extractor reduces spatial redundancy and learns the features of high-dimensional traffic sign targets, but eliminates the characterization of micro traffic sign targets, so that the identification rate of the micro traffic sign targets is low. Because the characteristics of the small traffic sign target are easily interfered by extreme weather, the detection network is difficult to acquire key semantic information, and the identification accuracy is low.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a method and a system for identifying traffic signs by unmanned vehicles in extreme weather; the method solves the technical problems that in the prior art, due to the fact that the environment is complex in a real road scene, the recognition accuracy is low due to the fact that illumination exists, objects are blocked, small objects are difficult to detect and the like, and the rapid recognition cannot be achieved due to the fact that the driving speed in the real road scene is high, achieves the purpose of accurately recognizing the small traffic sign objects through a traffic sign recognition model, balances the performance of detection speed and recognition accuracy, improves the characteristic extraction capability of a network, improves the detection effect on the blocked objects and the small objects, and detects the road traffic sign in real time in the real scene.

In a first aspect, the present invention provides a method for an unmanned vehicle to identify traffic signs in extreme weather;

a method for an unmanned vehicle to identify traffic signs in extreme weather, comprising:

under extreme weather, acquiring a traffic sign graph to be identified; the extreme weather refers to rain and snow weather or haze weather;

aiming at rain and snow weather, carrying out rain and snow removal pretreatment on the traffic sign map to be identified; aiming at haze weather, defogging pretreatment is carried out on a traffic sign graph to be identified;

and identifying the preprocessed image by adopting a trained traffic sign identification model to obtain an identified traffic sign.

In a second aspect, the present invention provides a system for identifying traffic signs for unmanned vehicles in extreme weather;

a system for identifying traffic signs for unmanned vehicles in extreme weather, comprising:

an acquisition module configured to: under extreme weather, acquiring a traffic sign graph to be identified; the extreme weather refers to rain and snow weather or haze weather;

a preprocessing module configured to: aiming at rain and snow weather, carrying out rain and snow removal pretreatment on the traffic sign map to be identified; aiming at haze weather, defogging pretreatment is carried out on a traffic sign graph to be identified;

An identification module configured to: and identifying the preprocessed image by adopting a trained traffic sign identification model to obtain an identified traffic sign.

In a third aspect, the present invention also provides an electronic device, including:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer-readable instructions,

wherein the computer readable instructions, when executed by the processor, perform the method of the first aspect described above.

In a fourth aspect, the invention also provides a storage medium storing non-transitory computer readable instructions, wherein the instructions of the method of the first aspect are performed when the non-transitory computer readable instructions are executed by a computer.

In a fifth aspect, the invention also provides a computer program product comprising a computer program for implementing the method of the first aspect described above when run on one or more processors.

Compared with the prior art, the invention has the beneficial effects that:

the invention adopts a coordinate attention mechanism, and can effectively capture the relation between the position information and the channel by adopting the coordinate attention, more accurately acquire the region of interest and weaken the interference of extreme weather environment, thereby improving the characteristic weight of the positive sample.

The invention adopts SPD-Conv, adds the new convolutional neural network module, well solves the problems of loss of fine granularity information and learning of low-efficiency characteristic representation caused by convolutional step length and each pooling layer.

According to the invention, a weighted bidirectional feature pyramid network is adopted, more features are fused under the condition of not increasing the cost, the feature extraction capability of the network is improved, and the detection effect of the small-target traffic sign under the influence of extreme weather is improved, so that the feature extraction capability of the network is improved, and the detection effect of the small-target traffic sign target is improved.

The invention uses the double decoupling heads to accelerate the network convergence speed and improve the precision of the micro traffic sign target in extreme weather.

The invention adopts the loss function SIoU, and improves the speed and the reasoning accuracy of the traffic sign recognition network training in extreme weather.

According to the invention, the C3 module is replaced by the light-weight C2f module, so that the light-weight C2f module is ensured, and meanwhile, richer gradient flow information is obtained, so that the traffic sign recognition model is lighter and easier to deploy on an unmanned vehicle.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flow chart of a method according to a first embodiment;

FIG. 2 is a traffic sign recognition model according to the first embodiment;

FIG. 3 is a diagram of SPD-Conv when scale is 2 according to the first embodiment;

FIG. 4 is a schematic diagram showing the internal structures of the CBS module and the C2f module according to the first embodiment;

FIG. 5 is a schematic diagram of an internal structure of an SPPF spatial pyramid pooling feature module according to the first embodiment;

FIG. 6 is a schematic diagram of an internal structure of a Coordinate Attention coordinate attention mechanism according to the first embodiment;

fig. 7 is a schematic diagram of an internal structure of a coupled Head dual-decoupling detection Head according to the first embodiment.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, unless the context clearly indicates otherwise, the singular forms also are intended to include the plural forms, and furthermore, it is to be understood that the terms "comprises" and "comprising" and any variations thereof are intended to cover non-exclusive inclusions, such as, for example, processes, methods, systems, products or devices that comprise a series of steps or units, are not necessarily limited to those steps or units that are expressly listed, but may include other steps or units that are not expressly listed or inherent to such processes, methods, products or devices.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

All data acquisition in the embodiment is legal application of the data on the basis of meeting laws and regulations and agreements of users.

Example 1

The embodiment provides a method for identifying traffic signs by unmanned vehicles in extreme weather;

as shown in fig. 1, a method for identifying traffic signs by unmanned vehicles in extreme weather includes:

s101: under extreme weather, acquiring a traffic sign graph to be identified; the extreme weather refers to rain and snow weather or haze weather;

s102: aiming at rain and snow weather, carrying out rain and snow removal pretreatment on the traffic sign map to be identified; aiming at haze weather, defogging pretreatment is carried out on a traffic sign graph to be identified;

s103: and identifying the preprocessed image by adopting a trained traffic sign identification model to obtain an identified traffic sign.

Further, the step S101: and acquiring a traffic sign image to be identified, and acquiring an image by adopting a high-definition camera of the unmanned vehicle.

Further, the step S102: and carrying out rain and snow removing pretreatment on the traffic sign graph to be identified, and realizing the method by adopting a Multi-stage progressive repair network MPR-Net (Multi-Stage Progressive Image Restoration Network).

It will be appreciated that the use of a Multi-stage progressive repair network (Multi-Stage Progressive Image Restoration Network) for preprocessing under the rain and snow weather image allows the rain and snow image to be balanced between spatial detail and high level context information when recovered.

Further, the step S102: defogging pretreatment is carried out on the traffic sign graph to be identified, and the defogging pretreatment is realized by adopting a feature fusion attention network FFA-Net (Fusion Feature Attention Network).

It will be appreciated that the feature fusion attention network FFA-Net (Fusion Feature Attention Network) is employed for preprocessing of foggy images, which is image defogging processing in an end-to-end manner.

Further, the step S103: and identifying the preprocessed image by adopting a trained traffic sign identification model to obtain an identified traffic sign, wherein the trained traffic sign identification model has a network structure comprising:

the diaphyseal layer, the neck layer, the attention layer and the prediction layer are connected in sequence;

the backbone layer is used for extracting the characteristics of the preprocessed image; the neck layer is used for carrying out multi-scale feature fusion on the feature map; the attention layer is used for focusing on local information; and the prediction layer is used for carrying out final regression prediction.

Further, the backbone layer includes: the system comprises a Focus convolutional neural network layer, a CBS1 layer, a convolutional layer SPD-Conv1 layer, a C2f1 layer, a CBS2 layer, a convolutional layer SPD-Conv2 layer, a C2f2 layer, a CBS3 layer, a convolutional layer SPD-Conv3 layer, a C2f3 layer, a CBS4 layer, a convolutional layer SPD-Conv4 layer and a C2f4 layer which are sequentially connected.

The Focus convolutional neural network layer is used for reducing an input large-size characteristic diagram into a small-size characteristic diagram so as to improve the speed and efficiency of the model. The Focus convolutional neural network layer achieves scaling of the feature map by using a stride convolutional and a hole convolutional.

An internal structure of a Focus convolutional neural network layer, comprising: two stride convolution layers and a cavity convolution layer that connect gradually. First, the input large-size feature map is reduced to half size by a stride convolution layer, and then further reduced to one fourth by a stride convolution layer. And finally, extracting the features of the reduced feature map through a cavity convolution layer.

The working principle of the Focus convolutional neural network layer is that the number of parameters and the calculated amount in the model can be reduced by reducing the input large-size characteristic diagram into a small-size characteristic diagram, so that the speed and the efficiency of the model are improved. Meanwhile, by using the stride convolution and the cavity convolution, the receptive field and the feature extraction capability of the feature map after the shrinkage can be maintained, so that the detection precision of the model is not affected. The Focus convolutional neural network layer is designed to adapt to a plurality of different input sizes and target sizes, so that the universality and the robustness of the model are improved.

The CBS layer is a novel convolution operation consisting of compression convolution and expansion convolution, and can compress and expand the characteristic diagram of the model, so that the receptive field and the characteristic expression capacity of the model are improved.

Further, the internal structures of the convolution layer SPD-Conv1, the convolution layer SPD-Conv2, the convolution layer SPD-Conv3 and the SPD-Conv4 are the same, and all are realized by adopting the convolution layer SPD-Conv.

Further, as shown in fig. 3, the convolutional layer SPD-Conv1 includes: space-to-depth (SPD) layers and non-convolution step size (non strided convolution, conv) layers.

Space-to-depth (SPD) layer, given any (original) feature map x, the sub-feature map sequence slice is fx _x,y ，f _x,y The feature map X (i, j) is composed of all feature maps, i+x and j+y being divisible by a scale, so that each sub-map downsamples X by a scale factor. Fig. 3 gives an example, when scale=2, we get four sub-maps f _0,0 ,f _0,1 ,f _1,0 ,f _1,1 They are in the shape ofAnd downsampling x by a factor of 2, and then concatenating the sub-feature maps along the channel dimension to obtain a feature map x' whose spatial dimension is reduced by a scale factor, and whose channel dimension is increased by a scale factor of 2.

Further, the working process of the convolutional layer SPD-Conv1 comprises the following steps: the method comprises the steps of performing self-adaptive convolution operation on an input characteristic image by using a space-to-depth (SPD) layer, and then performing convolution operation on an output characteristic image by using a convolution-free step layer.

Further, the internal structures of the C2f1 layer, the C2f2 layer, the C2f3 layer, and the C2f4 layer are uniform.

Further, as shown in fig. 4, the C2f1 layer includes:

the CBS5 layer, the Split network layer, the first Bottleneck residual block structure layer, the series splicing layer L1 and the CBS6 layer are sequentially connected;

the output end of the first Bottleneck residual block structure layer is connected with the input end of the second Bottleneck residual block structure layer;

the output end of the second Bottleneck residual block structure layer is connected with the input end of the third Bottleneck residual block structure layer;

the output end of the second Bottleneck residual block structure layer and the output end of the third Bottleneck residual block structure layer are connected with the input end of the series splicing layer L1; the output end of the Split network layer is connected with the input end of the series splicing layer L1.

Further, the Split network layer divides the input feature map into two branches, wherein one branch directly outputs without any operation, and the other branch outputs after transformation. The method can improve the diversity and the robustness of the model, and can reduce the problem of gradient disappearance in the model, thereby improving the training effect of the model. Meanwhile, the Split network layer can realize cross-layer feature fusion, so that the context information of the object can be captured better.

Further, the internal structures of the first, second and third bottleback residual block structure layers are identical, and the first bottleback residual block structure layer includes a convolution layer of 1*1, a convolution layer of 3*3 and a convolution layer of 1*1, which are sequentially connected.

Further, as shown in fig. 5, the SPPF spatial pyramid pooling feature layer includes:

the CBS7 layer, the first maximum pooling layer, the second maximum pooling layer, the third maximum pooling layer, the series splicing layer L2 and the CBS8 layer are sequentially connected; the CBS7 layer is used as an input end of the SPPF layer, and the CBS8 layer is used as an output end of the SPPF space pyramid pooling feature layer;

the output end of the CBS7 layer, the output end of the first largest pooling layer and the output end of the second largest pooling layer are all connected with the input end of the series splicing layer.

Further, the internal structures of CBS1 layer, CBS2 layer, CBS3 layer, CBS4 layer, CBS5 layer, CBS6 layer, CBS7 layer and CBS8 layer are the same.

Further, as shown in fig. 4, the CBS1 layer includes: the convolution layer, the batch normalization layer and the activation function layer are sequentially connected.

Further, the working process of the SPPF space pyramid pooling feature layer comprises the following steps: the input feature images are divided into sub-images with different sizes, and the maximum pooling operation is carried out on each sub-image, so that a feature vector with a set size is obtained.

In the SPPF space pyramid pooling feature module, different sizes are selected to adapt to target objects with different sizes, and the receptive field of the model is improved, so that objects with different sizes are better detected; meanwhile, the SPPF space pyramid pooling feature module can reduce the compression of the feature map size, so that information loss is avoided. The SPPF space pyramid pooling feature layer structure is that input is pooled by a plurality of maximum sizes in parallel, and then further fused, so that the problem of multiple scales of targets can be solved to a certain extent.

It will be appreciated that the backbone network functions to extract features of the image. The original C3 module is replaced by the lighter C2f module, the C2f module is designed by combining the ideas of the C3 module and the ELAN efficient layer aggregation network, so that the more abundant gradient flow information is obtained while the lighter weight is ensured, and the traffic sign recognition model is lighter and is easier to deploy on an unmanned vehicle.

It will be appreciated that the backbone network introduces SPD-Conv for better capture of characteristic information of small target traffic signs, which is to add a space to depth (SPD) layer followed by a step-free convolutional layer. The SPD layer downsamples the feature map X, but retains all information in the channel dimension, so no information is lost. The original image is scaled using image conversion techniques before being input into the neural network, and generalized to the downsampling feature map within and throughout the network. In addition, a step-less convolution operation is added after each SPD to reduce the number of channels in the added convolution layer using the learnable parameters.

Further, as shown in fig. 2, the neck layer includes: the system comprises a CBS9 layer, a first upsampling layer, a series splicing layer L3, a C2f5 layer, a CBS10 layer, a second upsampling layer and a series splicing layer L4 which are sequentially connected from bottom to top;

the neck layer further comprises: the C2f6 layer, the CBS11 layer, the series splicing layer L5, the C2f7 layer, the CBS12 layer, the series splicing layer L6 and the C2f7 layer are sequentially connected from top to bottom;

the input end of the CBS9 layer is connected with the output end of the SPPF space pyramid pooling feature layer;

the input end of the series splicing layer L3 is connected with the output end of the C2f3 layer;

the input end of the series spliced layer L4 is connected with the output end of the C2f2 layer;

the output end of the CBS9 layer is connected with the input end of the series splicing layer L6;

the output end of the CBS10 layer is connected with the input end of the series splicing layer L5;

and the output end of the series spliced layer L4 is connected with the input end of the C2f6 layer.

It should be understood that, when the neck layer and the belonging feature source image are subjected to average pooling decomposition, in order to make the network accuracy and efficiency more balanced, the original path aggregation network in the YOLOv5 network is replaced by Bi-FPN (weighted bidirectional feature pyramid network), which adopts efficient bidirectional trans-scale connection, and adopts Bi-FPN (weighted bidirectional feature pyramid network) weighted feature fusion, so as to form a top-down and bottom-up feature pyramid network, and when the deeper information needs to be acquired, additional edges can be added on the original basis, so that more traffic sign features are fused without increasing the cost. Because the resolution of the input features varies, their impact on the output results will also vary. Increasing the efficiency of the system by adding more weight allows the network to better understand and identify the criticality of various input features.

Bi-FPN (weighted Bi-directional feature pyramid network) allows the network to concentrate more on critical layers by assigning different weights to each layer, while also reducing unnecessary inter-layer connections.

In order to maximize efficiency, when the original input and output nodes reach the same level, we will realize integration of more information by adding new edges, thereby reducing the operation cost; by constructing a plurality of feature network layers from top to bottom and from bottom to top and iterating the layers continuously, the efficiency of feature fusion can be effectively improved.

Further, as shown in fig. 2, the attention layer includes: three parallel first Coordinate Attention, second Coordinate Attention and third Coordinate Attention coordinate attention modules;

the input end of the first Coordinate Attention coordinate attention module is connected with the output end of the C2f6 layer;

the input end of the second Coordinate Attention coordinate attention module is connected with the output end of the C2f7 layer;

and the input end of the third Coordinate Attention coordinate attention module is connected with the output end of the C2f8 layer.

Further, the internal structures of the first Coordinate Attention coordinate attention module, the second Coordinate Attention coordinate attention module and the third Coordinate Attention coordinate attention module are identical.

Further, as shown in fig. 6, the first Coordinate Attention coordinate attention module includes:

the input end of the residual error module is used as the input end of the first Coordinate Attention coordinate attention module; the output end of the residual error module is connected with the input end of the weight updating module;

the output end of the residual error module is also respectively connected with the input end of the horizontal direction average pooling layer and the input end of the vertical direction average pooling layer; the horizontal direction average pooling layer and the vertical direction average pooling layer are realized through the average pooling layer;

the output end of the horizontal direction average pooling layer and the output end of the vertical direction average pooling layer are connected with the input end of the serial splicing layer L7, the output end of the serial splicing layer L8 is connected with the input end of the two-dimensional convolution layer E1, the output end of the two-dimensional convolution layer is connected with the input end of the normalization layer, the output end of the normalization layer is connected with the input end of the nonlinear activation function layer, the output end of the nonlinear activation function layer is respectively connected with the input end of the two-dimensional convolution layer E2 and the input end of the two-dimensional convolution layer E3, the output end of the two-dimensional convolution layer E2 is connected with the input end of the first activation function layer, the output end of the two-dimensional convolution layer E3 is connected with the input end of the second activation function layer, and the output end of the first activation function layer and the output end of the second activation function layer are both connected with the input end of the weight updating module; the output end of the weight updating module is used as the output end of the first Coordinate Attention coordinate attention module.

respectively aggregating the input feature graphs along the vertical and horizontal directions into two feature graphs embedded with direction information by utilizing two one-dimensional global pooling operations;

then respectively encoding the two feature maps embedded with the direction information into two attention maps, wherein each attention map captures the remote dependency relationship of the input feature map along one spatial direction;

the location information is saved in the generated attention profile and then both attention profiles are applied to the input feature map by multiplication to emphasize the representation of the attention area.

(11): inputting a feature map;

(12): coordinate extraction: extracting the coordinate positions of the feature images to obtain two coordinate vectors, wherein the two coordinate vectors respectively represent the horizontal coordinates and the vertical coordinates in the feature images;

(13): coordinate embedding: respectively embedding the two coordinate vectors to obtain two new vectors, wherein the two new vectors respectively represent an embedded vector of a horizontal coordinate and an embedded vector of a vertical coordinate;

(14): feature embedding: feature embedding is carried out on the feature map through a convolution layer, so that a new feature vector is obtained;

(15): coordinate attention calculation: calculating an attention weight between the embedded vector and the feature vector of the horizontal coordinate using the dot product; calculating an attention weight between the embedded vector of the vertical coordinates and the feature vector using the dot product; obtaining two coordinate attention vectors;

(16): coordinate attention weighting: carrying out weighted addition on the coordinate attention vector and the feature vector to obtain a final weighted feature vector;

(17): outputting a characteristic diagram: and outputting the weighted feature vector.

It should be understood that the attention layer, the working process of which includes: and weighting information of different positions in the feature map so as to improve the detection accuracy and the robustness of the model to the target. By using the coordinate extraction and embedding technique, the coordinate position information in the feature map can be fused with the feature vector, thereby generating a feature vector with more expressive force. The coordinate attention mechanism is used, and meanwhile, the number of model parameters and the calculation complexity can be kept unchanged, so that the performance and the efficiency of the model are improved.

It should be understood that, in order to accurately capture channel information and position information at the same time, and better acquire small-target feature information of traffic sign under extreme weather influence, the position information of each illustrated traffic sign feature image block is embedded into channel attention, and the information embedding stage firstly uses pooling cores with the sizes of (H, 1) and (1, w) for feature image layers input into the coordinate attention mechanism module to pool each channel in a two-dimensional direction. The position information and the channel information are better captured with a small amount of calculation, thereby facilitating network positioning of the object of interest.

Further, the first Coordinate Attention coordinate attention module works according to the following principle:

the pooled outputs of the c-th channel with height H and the c-th channel with width W are as follows:

the output of the c-th channel of width w is shown in equation (2)

Wherein x is _c (h, i) and x _c (j, w) represent features in the two-dimensional direction of the feature map, respectively.

Further, the prediction layer includes: the device comprises three parallel first coupled Head double-decoupling detection Head modules, second coupled Head double-decoupling detection Head modules and third coupled Head double-decoupling detection Head modules;

the input end of the first coupled Head double-decoupling detection Head module is connected with the output end of the first Coordinate Attention coordinate attention module;

the input end of the second coupled Head double-decoupling detection Head module is connected with the output end of the second Coordinate Attention coordinate attention module;

the input end of the third coupled Head double-decoupling detection Head module is connected with the output end of the third Coordinate Attention coordinate attention module.

Further, the internal structures of the first, second and third Decoupled Head double-decoupling detection Head modules are identical.

Further, as shown in fig. 7, the first coupled Head dual-decoupling detection Head module includes: a CBS13 layer;

the input end of the CBS13 layer is used as the input end of the first coupled Head double-decoupling detection Head module;

the output end of the CBS13 layer is respectively connected with the input end of the CBS14 layer and the input end of the CBS15 layer;

the output end of the CBS14 layer is connected with the Cls class layer through the convolution layer J1 and the third activation function layer in sequence;

the output end of the CBS15 layer is respectively connected with the input end of the convolution layer J2 and the input end of the convolution layer J3;

the output end of the convolution layer J2 is connected with the Reg coordinate layer;

the output end of the convolution layer J3 is connected with a fourth activation function layer, and the fourth activation function layer is connected with an Obj target layer.

The Cls class layer refers to a vector with a size of C, which represents the probability that a prediction frame belongs to each target class, wherein C is the number of target classes, and for each prediction frame, a Cls class vector is predicted, which represents the probability that the frame belongs to each target class;

the Reg coordinate layer refers to position and size information of a prediction frame, is a vector with the size of 4, and respectively represents center coordinates, width and height of the prediction frame; for each predicted frame, predicting a reg coordinate vector representing the position and size information of the frame;

The Obj target layer refers to a confidence level indicating that the target is contained in the prediction frame, and is a scalar, and for each prediction frame, an Obj target value is predicted to indicate whether the target is contained in the frame.

Further, the working process of the first coupled Head double-decoupling detection Head module includes:

(21): target frame regression: sending the feature map into a target frame regression branch, and obtaining a predicted value of the target frame through a convolution layer and a full connection layer; the output of the target box regression branch typically includes center coordinates, width, and height information of the target box;

(22): object classification: sending the feature map into a target classification branch, and obtaining a predicted value of a target class through a convolution layer and a full connection layer; the output of the target classification branch typically includes a probability value for the target belonging to each category;

(23): target frame decoding: converting the output of the target frame regression branch into actual target frame coordinates; usually, an anchor block (anchor box) is adopted for decoding, namely, each anchor block is subjected to offset prediction, so that actual target frame coordinates are obtained;

(24): target frame screening: and screening and filtering the target frame according to the confidence score (confidence score) and the class probability of the target frame to obtain a final target detection result.

Further, the working process of the prediction layer comprises the following steps: dividing the feature map into a plurality of grids, and predicting a set of bounding boxes for each grid; for each bounding box, the prediction layer predicts its position, size, and a first confidence level, the first confidence level being indicative of whether the object is contained within the bounding box; for each target class, the prediction layer predicts a second confidence, the second confidence representing whether the target of the current class is contained within the bounding box. These predictions are calculated by convolution and full connection layers.

It should be appreciated that in the prediction layer, there is a conflict between classification and regression tasks. A double decoupling head widely used for most primary and secondary detector classification and localization is employed in order to avoid collisions. As shown in fig. 7, the double decoupling header further decouples the classification sub-network and the regression sub-network, and two independent sub-networks are respectively designed: detecting a sub-network and classifying the sub-network. The detection sub-network is only used to predict the position and size of the object, while the classification sub-network is dedicated to predicting the class of the object. The convergence speed of the network is improved, the network is more stable, and the accuracy of the micro traffic sign target in extreme weather is improved.

Further, the step S103: and identifying the preprocessed image by adopting a trained traffic sign identification model to obtain an identified traffic sign, wherein the training process of the trained traffic sign identification model comprises the following steps of:

constructing a training set and a testing set; the training set and the testing set both comprise: a traffic sign image set of known traffic sign recognition results;

inputting the training set into a traffic sign recognition model, training the model, and stopping training when the total loss function value of the model is not reduced any more, so as to obtain a preliminary traffic sign recognition model;

and testing the preliminary traffic sign recognition model by adopting a test set, wherein when the test accuracy exceeds a set threshold, the current preliminary traffic sign recognition model is the trained traffic sign recognition model.

Further, the constructing the training set specifically includes:

collecting traffic sign images in real road condition scenes;

performing image enhancement processing and image filling processing on the collected images;

marking the collected traffic sign images by using a data marking tool;

and clustering the traffic sign data set by adopting a clustering algorithm to obtain an anchor frame.

It should be understood that the K-means++ clustering algorithm is used to re-cluster the traffic sign dataset to obtain anchor frames, so that the anchor frames are more suitable for traffic sign scenes in real environments, and the number and the size of the anchor frames in the YOLOv5 algorithm configuration file are adjusted.

Further, the total loss function of the model includes:

the total loss function L is:

L＝W _box L _box +W _cls L _cls

wherein W is _box And W is _cls Respectively frame and class loss weights, L _box Is a regression loss function, L _cls Is a Focal Loss, which is a Loss function used to solve the problem of class imbalance.

The regression loss function is:

where Δ is Distance cost, Ω is Shape cost, and IoU is an index for evaluating the degree of overlap between a predicted frame and a real frame in tasks such as object detection and semantic segmentation. IoU is the intersection area of the prediction and real frames divided by their union area.

Firstly, the unmanned perception equipment triggers a target event, wherein the trigger event is that when an unmanned automobile runs on a traffic road, an image data frame of the front road is received in real time, and then traffic sign characteristic information of a target characteristic image in the image data frame is identified and acquired through a traffic sign identification model.

Secondly, preprocessing is carried out on the acquired images, the acquired images are roughly divided into rain, snow and haze weather under extreme weather, and feature fusion attention network (FFA-Net) is adopted for preprocessing the foggy images, so that image defogging processing is carried out in an end-to-end mode. On the other hand, a Multi-stage progressive repair network MPR-Net (Multi-Stage Progressive Image Restoration Network) is adopted for preprocessing in the rain and snow weather image, so that the balance between space detail and high-level context information can be achieved when the rain and snow image is recovered. The preprocessed image is then put into a model adapted to the identification of micro traffic signs for detection.

Secondly, after receiving the preprocessed image data frame, the traffic sign recognition model carries out a continuous downsampling process on the received data, and then carries out pooling decomposition operation in a two-dimensional direction to obtain a plurality of decomposed characteristic image blocks; and performing splicing fusion of each stage according to the same dimensional information in the characteristic image blocks, and acquiring the target characteristic image after convolution processing in the two-dimensional direction.

In order to solve the problem of traffic sign recognition under extreme weather, the traffic sign recognition method is better suitable for the influence of interference under extreme weather, and the adaptability of the traffic sign recognition model is improved. The traffic sign recognition model in the method is formed by training a lightweight convolutional neural network structure, and the whole flow is shown in figure 2.

The specific implementation process of the traffic sign recognition model in the method comprises the following steps: firstly, continuously carrying out pooling treatment on the acquired image data frames, setting an image decomposition network structure in a backbone network, circularly extracting characteristic source images with different size requirements according to a preset downsampling rule after the image decomposition network structure receives the image data frames, and respectively carrying out average pooling decomposition on the characteristic images according to a two-dimensional direction to acquire a plurality of micro traffic sign target characteristic image blocks.

The method of the invention further comprises preprocessing the received image data frame before downsampling, and then downsampling after adjusting the resolution of the input image data frame to a predetermined size. Under the condition of not changing the size of a video memory, the size of batch processing and the number of threads, the invention increases the image data frame with the input resolution of 640 multiplied by 640 to 1280 multiplied by 1280, so as to more accurately acquire the target characteristic information of the micro traffic sign, thereby relieving the loss of the characteristic information of the traffic sign in the down sampling process. And (3) re-clustering the traffic marking data set by using a K-means++ clustering algorithm to obtain anchor frames, so that the anchor frames are more suitable for traffic marking scenes in real environments, and the number and the size of the anchor frames in the YOLOv5 algorithm configuration file are adjusted. The C3 module is improved, the lighter C2f module is adopted, as shown in fig. 4, the C2f module is designed based on the thought of the layer aggregation network with high efficiency of the C3 module and the ELAN, the more abundant gradient flow information is obtained while the light weight is ensured, and the traffic sign recognition model is lighter and is easier to deploy on an unmanned vehicle.

For better capturing of the characteristic information of the micro traffic sign target, SPD-Conv is introduced, which is to add a space to depth (SPD) layer followed by a step-free convolution layer. The SPD layer downsamples the feature map X, but retains all information in the channel dimension, so no information is lost. The original image is scaled using image conversion techniques before being input into the neural network, and generalized to the downsampling feature map within and throughout the network. In addition, a step-free convolution operation is added after each SPD, so that the number of channels is reduced by using a leavable parameter in an added convolution layer, and the sensitivity of the network to a micro traffic sign target is increased by using SPD-Conv, so that the network can more comprehensively capture traffic sign characteristic information.

In the traffic sign recognition practice under extreme weather interference, two important factors of speed and precision are considered, when the traffic sign feature source image is subjected to average pooling decomposition, in order to enable the traffic sign feature source image to acquire traffic sign feature information under the extreme weather interference faster and more accurately, an original path aggregation network in a YOLOv5 network is replaced by a weighted bidirectional feature pyramid network, after feature points are subjected to continuous downsampling, a stack of feature layers with high semantic content is provided, and then upsampling is carried out again, so that the length and width of the feature layers are enlarged again, a large-size feature image is used for detecting micro traffic sign targets, different weights are given to each layer to fuse, the network is more focused on important feature layers, semantic information of the micro traffic sign targets is more focused, node connection of a redundant layer is reduced, additional weight is not needed, and the model is lighter.

In addition, in order to accurately capture channel information and position information at the same time and better acquire the micro traffic sign target feature information under extreme weather, the position information of each traffic sign feature image block is embedded into the channel attention, as shown in fig. 6, in the information embedding stage, firstly, a pooling core with the sizes of (H, 1) and (1, w) is used for the feature image layer input into the coordinate attention mechanism module, and each channel in the two-dimensional direction is pooled. The location information and channel information are better captured with a small increase in computational effort to facilitate network localization of the object of interest.

After the two-dimensional convolution processing is completed, the features after the two-dimensional convolution processing are respectively classified by using an activation function. IoU loss can more accurately predict the localization of the bounding box regression, since CIoU used by YOLOv5 relies on the aggregation of bounding box regression metrics, regardless of the desired direction of mismatch between the ground box and the predicted "experimental" box. This results in a poor training speed and prediction accuracy as compared to SIoU. The SIoU reduces the loss weight of a large and medium target, relieves the condition of unbalanced size distribution of the characteristic image block in the recognition process, and realizes more accurate loss calculation between a prediction frame and a real frame in the traffic sign recognition task. In the invention, when the target feature images are classified by using the activation function, the real information judgment is also carried out on the traffic sign feature images by using the loss function.

In the predictive layer, conflicts between classification and regression tasks are a well known problem. Thus, dual decoupling heads for sorting and positioning are widely used for most primary and secondary detectors. The re-coupling is replaced by a dual-decoupling detection head by a primary detection head, so that the convergence speed of the network is improved.

Therefore, aiming at the problems of traffic sign identification in extreme weather, the method balances the performance of detection speed and identification precision, improves the characteristic extraction capability of a network, improves the detection effect on targets and small targets under extreme weather interference, and can detect road traffic signs in real extreme weather scenes in real time.

Compared with the prior art, the invention introduces the SPD-Conv to replace each convolution step length and each pooling layer, well solves the problems of loss of fine granularity information and learning of low-efficiency characteristic representation caused by the convolution step length and each pooling layer, and better captures the characteristic information of a micro traffic sign target. By adopting a coordinate attention mechanism, as shown in fig. 6, the position information is embedded into the channel attention, so that the lightweight convolutional neural network structure can obtain the micro traffic sign target information of a larger area under extreme weather influence and obtain the important traffic sign characteristic information more accurately. And adding cross-layer connection in the feature fusion network structure, namely adding an extra path so as to fuse more features without increasing cost, improving the extraction capability of the network to the feature information of the small target, and improving the detection effect of the traffic sign target under extreme weather interference. The invention adopts the double decoupling heads, and the classified characteristics are inconsistent with the regressive characteristics, so that the double decoupling heads are used for accelerating the network convergence speed and improving the detection precision of the small-target traffic sign. The invention improves the C3 module and adopts the light-weight C2f module, as shown in fig. 4, the C2f module is designed by referring to the ideas of the C3 module and the ELAN efficient layer aggregation network, so that the light weight is ensured, and meanwhile, the richer gradient flow information is obtained, so that the traffic sign recognition model is lighter and is easier to deploy on an unmanned vehicle. The invention adopts SIoU loss, reduces the loss weight of small targets, relieves the situation of unbalanced size distribution of the characteristic samples in the traffic sign recognition process, realizes more accurate loss calculation between prediction information and real information in the traffic sign recognition task, and improves the recognition precision of micro traffic sign targets.

Comparative data were performed:

in order to verify the stability of the proposed algorithm and optimize network parameters, an existing CCTSDB traffic sign data set is selected as a sample set, 3000 custom shielding interference data sets, a large amount of foggy day data and the like are respectively added, the generalization capability of a model can be fully tested, and the data sets are processed according to 8: the ratio of 2 divides the training set and the test set. The initial learning rate was set to 0.003 and the batch size was set to 16. The experimental hardware configuration is shown in table 1.

Table 1: experimental hardware fitting list

Table 2 comparative experiments (extended CCTSDB traffic sign dataset):

experimental results show that the improved YOLOv5 algorithm model reaches 89.1% on the expanded CCTSDB traffic sign data set, is improved by 2.3% compared with YOLOv5, remarkably improves the detection precision of traffic signs in practical application occasions and can accurately acquire traffic sign information in extreme weather environments, so that target information is accurately judged.

Example two

The embodiment provides a system for identifying traffic signs by unmanned vehicles in extreme weather;

It should be appreciated that the task of the acquisition module is to acquire image data from a camera or other sensor. Suitable sensors and cameras are chosen to ensure that high quality images can be acquired. Factors such as resolution, frame rate, and field of view of the camera need to be considered when selecting the camera.

It should be understood that, for rain and snow removal pretreatment is performed on traffic sign images to be identified in rainy and snowy weather, the MPR-Net multi-stage progressive image restoration network is utilized, so that the balance between space detail and high-level context information can be achieved when the rain and snow images are restored, feature extraction is performed by means of a codec, and then information storage is performed by means of fusion of local information. The pixel adaptive adjustment design is used in each layer of structure, and the local attention is adjusted through an attention supervision mechanism. This has the advantage that no difference in information exchange between the different layers is achieved. Therefore, when the information is not transmitted in a front-to-back mode according to the original sequence, the feature processing modules are transversely connected to avoid the loss of the information; aiming at haze weather, defogging pretreatment is carried out on a traffic sign graph to be identified, and the feature fusion attention network FFA-Net is a deep learning network for recovering a foggy image in an end-to-end mode. In the feature fusion attention network FFA-Net, the input is a foggy image, which enters a shallow feature extractor first, then enters N groups of group structures with multiple jump connections, then the feature maps are fused together with the attention module, this feature is transferred to the learning structure of the reconstruction region and global residual, and thus foggy output is obtained.

It should be understood that, the pre-processed image is identified by using a trained traffic sign identification model, so as to obtain an identified traffic sign, and the specific flow is as follows: first, an image to be detected is input into the modified YOLOv5 model. The input image is then passed through a Backbone network of convolutional and pooling layers for extracting features of the image. On the basis of extracting features from the Backbone network of the Backbone, a Neck Neck network is also introduced for further extracting the features of the image and optimizing the features. On the output feature map of the Neck Neck network, a Head network is introduced for detecting the target object. The Head network includes a plurality of output layers, each of which outputs a specific size of feature map, each of which is used to detect objects of different sizes.

It should be noted that the above-mentioned obtaining module, preprocessing module and identifying module correspond to steps S101 to S103 in the first embodiment, and the above-mentioned modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

The foregoing embodiments are directed to various embodiments, and details of one embodiment may be found in the related description of another embodiment.

The proposed system may be implemented in other ways. For example, the system embodiments described above are merely illustrative, such as the division of the modules described above, are merely a logical function division, and may be implemented in other manners, such as multiple modules may be combined or integrated into another system, or some features may be omitted, or not performed.

Example III

The embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein the processor is coupled to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of the first embodiment.

Example IV

The present embodiment also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the method of embodiment one.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for identifying traffic signs by unmanned vehicles in extreme weather, comprising:

2. The method for identifying traffic sign by unmanned vehicle in extreme weather according to claim 1, wherein the preprocessing image is identified by using a trained traffic sign identification model, so as to obtain the identified traffic sign, and the network structure of the trained traffic sign identification model comprises:

3. The method for identifying traffic signs for unmanned vehicles in extreme weather according to claim 2, wherein said backbone layer comprises: the system comprises a Focus convolutional neural network layer, a CBS1 layer, an SPD-Conv1 layer, a C2f1 layer, a CBS2 layer, an SPD-Conv2 layer, a C2f2 layer, a CBS3 layer, an SPD-Conv3 layer, a C2f3 layer, a CBS4 layer, an SPD-Conv4 layer, a C2f4 layer and an SPPF space pyramid pooling layer which are connected in sequence;

the SPD-Conv1 layer comprises the following working processes: firstly, performing self-adaptive convolution operation on an input feature map by using a space-to-depth layer, and then performing convolution operation on an output feature map by using a convolution-free step-length layer;

the C2f1 layer includes:

the output end of the second Bottleneck residual block structure layer and the output end of the third Bottleneck residual block structure layer are connected with the input end of the series splicing layer L1; the output end of the Split network layer is connected with the input end of the series splicing layer L1;

the SPPF space pyramid pooling feature layer comprises:

the CBS7 layer, the first maximum pooling layer, the second maximum pooling layer, the third maximum pooling layer, the series splicing layer L2 and the CBS8 layer are sequentially connected; the CBS7 layer is used as an input end of the SPPF space pyramid pooling feature layer, and the CBS8 layer is used as an output end of the SPPF space pyramid pooling feature layer;

the output end of the CBS7 layer, the output end of the first largest pooling layer and the output end of the second largest pooling layer are all connected with the input end of the series splicing layer;

the working process of the SPPF space pyramid pooling feature layer comprises the following steps: the input feature images are divided into sub-images with different sizes, and the maximum pooling operation is carried out on each sub-image, so that a feature vector with a set size is obtained.

4. The method for identifying traffic signs for unmanned vehicles in extreme weather according to claim 2, wherein said neck layer comprises: the system comprises a CBS9 layer, a first upsampling layer, a series splicing layer L3, a C2f5 layer, a CBS10 layer, a second upsampling layer and a series splicing layer L4 which are sequentially connected from bottom to top;

the input end of the CBS9 layer is connected with the output end of the SPPF space pyramid pooling feature layer; the input end of the series splicing layer L3 is connected with the output end of the C2f3 layer; the input end of the series spliced layer L4 is connected with the output end of the C2f2 layer; the output end of the CBS9 layer is connected with the input end of the series splicing layer L6; the output end of the CBS10 layer is connected with the input end of the series splicing layer L5; the output end of the series spliced layer L4 is connected with the input end of the C2f6 layer.

5. The method for identifying traffic signs for unmanned vehicles in extreme weather according to claim 2, wherein said attention layer comprises: three parallel first Coordinate Attention, second Coordinate Attention and third Coordinate Attention coordinate attention modules;

The input end of the third Coordinate Attention coordinate attention module is connected with the output end of the C2f8 layer;

the first Coordinate Attention coordinate attention module includes:

6. The method for identifying traffic signs for unmanned vehicles in extreme weather according to claim 2, wherein said predictive layer comprises: the device comprises three parallel first coupled Head double-decoupling detection Head modules, second coupled Head double-decoupling detection Head modules and third coupled Head double-decoupling detection Head modules;

the input end of the third coupled Head double-decoupling detection Head module is connected with the output end of the third Coordinate Attention coordinate attention module;

the first coupled Head double-decoupling detection Head module includes: a CBS13 layer;

the output end of the convolution layer J3 is connected with a fourth activation function layer, and the fourth activation function layer is connected with an Obj target layer;

the Obj target layer refers to confidence level representing that targets are contained in a prediction frame, is a scalar, predicts one Obj target value for each prediction frame, and represents whether targets are contained in the frame;

the working process of the prediction layer comprises the following steps: dividing the feature map into a plurality of grids, and predicting a set of bounding boxes for each grid; for each bounding box, the prediction layer predicts its position, size, and a first confidence level, the first confidence level being indicative of whether the object is contained within the bounding box; for each target class, the prediction layer predicts a second confidence, the second confidence representing whether the target of the current class is contained within the bounding box.

7. The method for identifying traffic signs by unmanned vehicles in extreme weather according to claim 1, wherein the pre-processed image is identified by using a trained traffic sign identification model, so as to obtain the identified traffic sign, and the training process of the trained traffic sign identification model comprises the following steps:

testing the preliminary traffic sign recognition model by adopting a test set, wherein when the test accuracy exceeds a set threshold, the current preliminary traffic sign recognition model is the trained traffic sign recognition model;

a total loss function of the model, comprising:

the total loss function L is:

L＝W _box L _box +W _cls L _cls

wherein W is _box And W is _cls Respectively frame and class loss weights, L _box Is a regression loss function, L _cls Is a Focal Loss, which is a method for solving the categoryA loss function of imbalance problem;

The regression loss function is:

wherein Δ is a distance loss, Ω is an angle loss, and IoU is an index for evaluating the degree of overlap between a predicted frame and a real frame in tasks such as target detection and semantic segmentation; ioU is the intersection area of the prediction and real frames divided by their union area.

8. A system for identifying traffic signs for unmanned vehicles in extreme weather, comprising:

9. An electronic device, comprising:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer-readable instructions,

wherein the computer readable instructions, when executed by the processor, perform the method of any of the preceding claims 1-7.

10. A storage medium, characterized by non-transitory storing computer-readable instructions, wherein the instructions of the method of any one of claims 1-7 are performed when the non-transitory computer-readable instructions are executed by a computer.