CN116977979A

CN116977979A - Traffic sign recognition method, system, equipment and storage medium

Info

Publication number: CN116977979A
Application number: CN202310927182.6A
Authority: CN
Inventors: 吴晓明; 尹训嘉; 刘祥志; 裴加彬; 邱文科
Original assignee: Shandong Shanke Intelligent Technology Co ltd; Qilu University of Technology; Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Shandong Shanke Intelligent Technology Co ltd; Qilu University of Technology; Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2023-07-26
Filing date: 2023-07-26
Publication date: 2023-10-31

Abstract

The application discloses a traffic sign recognition method and a system, wherein the method comprises the following steps: acquiring a video to be identified; inputting all frame images of the video to be identified into the trained traffic sign identification model, and outputting a traffic sign identification result; the trained traffic sign recognition model is used for: extracting features of all frame images of the video to be identified to obtain feature images of all frame images; extracting traffic sign candidate areas from the feature images of all the frame images; carrying out feature fusion on the traffic sign candidate region of the key frame image and the traffic sign candidate region of the adjacent frame image to obtain a fused traffic sign candidate region; extracting feature images of the fused traffic sign candidate areas to generate fused candidate area feature images; and classifying and regressing the fused candidate region feature images to obtain a traffic sign recognition result.

Description

Traffic sign recognition method, system, equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a traffic sign recognition method, system, device, and storage medium.

Background

The statements in this section merely relate to the background of the present disclosure and may not necessarily constitute prior art.

With the rapid development of artificial intelligence and computer technology, machine vision is an important premise for realizing unmanned driving, meaning how to let automobiles "see" the world in front of the eyes, and unmanned traffic sign recognition is an important field in the middle. The traffic sign recognition of the unmanned vehicle is to acquire a road scene image by using a vehicle-mounted camera, recognize road signs and semantics on the image, and belongs to important contents of unmanned judgment on current road indication, so that the automatic and accurate recognition of the road signs by the vehicle has important research significance. In image processing, image data quality is one of the key influencing factors. However, in the acquisition process, the vehicle-mounted camera is inevitably shielded to different degrees under the interference of extreme weather or irresistible factors, the quality of the acquired image is often not required, and a plurality of inconveniences are caused for subsequent data analysis.

Video object detection has certain advantages over still image object detection. Because the images in the video are continuous, and obvious context relation exists between adjacent frames in the video, when the target cannot be accurately tracked through one frame image in the video, the target detection of the current frame can be assisted according to other frames with the context relation of time, space and the like with the current frame.

The Chinese patent document CN116259032A discloses a road traffic sign detection and identification algorithm based on improved YOLOv5, realizes bidirectional fusion of deep and shallow features from top to bottom and from bottom to top, and remarkably improves the detection performance of a network model. But cannot cope with traffic sign recognition in complex scenes.

The Chinese patent document CN116152777A discloses a method, a system and a storage medium for identifying a common traffic sign based on YOLOv5, and the method has the advantages of high detection speed, high detection precision, small occupied video memory space and multiple identification target types. However, the recognition rate of traffic signs under the shielding of a small part of the unmanned vehicle-mounted camera is low, and the complex scene cannot be fully dealt with.

The application discloses a method for improving the recognition precision of traffic signs in extreme weather and environment, which is based on a YOLOv5 target detection model, integrates a focusing module, a cross-stage local fusion module and a spatial pyramid pooling structure, and can extract feature map information from local features better for traffic sign images with poor light, wherein the feature map expresses images more accurately. This patent does not address the problem of identifying traffic signs in the event of occlusion.

The vehicle-mounted camera can influence the recognition accuracy of traffic signs in target detection under the shielding of small parts of extreme weather (raindrops, mud stains, frost and snow and the like), and particularly the recognition accuracy of small target traffic signs, so that error conditions such as misjudgment of unmanned vehicles and the like often occur. Therefore, small-part shielding (raindrops, mud stains, frost and snow and the like) of the unmanned vehicle-mounted camera causes interference when extracting characteristic information from the target detection model, so that the model is difficult to acquire positive and effective semantic information, and the recognition accuracy is low.

Disclosure of Invention

In order to solve the defects in the prior art, the application provides a traffic sign recognition method, a system, equipment and a storage medium; the technical problem that in the prior art, targets cannot be accurately identified due to target identification of traffic signs under small-part shielding of the vehicle-mounted camera is solved.

In one aspect, a traffic sign recognition method is provided;

a traffic sign recognition method, comprising:

acquiring a video to be identified;

dividing all frame images of the video to be identified into key frame images and adjacent frame images; the key frame image refers to a frame which is different from the scene of the previous frame image but the same as the scene of the next frame image; the adjacent frame images refer to an image between a current key frame image and a previous key frame image and an image between the current key frame image and a next key frame image;

inputting all frame images of the video to be identified into the trained traffic sign identification model, and outputting a traffic sign identification result; the trained traffic sign recognition model is used for: extracting features of all frame images of the video to be identified to obtain feature images of all frame images; extracting traffic sign candidate areas from the feature images of all the frame images; carrying out feature fusion on the traffic sign candidate region of the key frame image and the traffic sign candidate region of the adjacent frame image to obtain a fused traffic sign candidate region; extracting feature images of the fused traffic sign candidate areas to generate fused candidate area feature images; and classifying and regressing the fused candidate region feature images to obtain a traffic sign recognition result.

In another aspect, a traffic sign recognition system is provided;

a traffic sign recognition system, comprising:

an acquisition module configured to: acquiring a video to be identified;

a keyframe partitioning module configured to: dividing all frame images of the video to be identified into key frame images and adjacent frame images; the key frame image refers to a frame which is different from the scene of the previous frame image but the same as the scene of the next frame image; the adjacent frame images refer to an image between a current key frame image and a previous key frame image and an image between the current key frame image and a next key frame image;

a traffic sign recognition module configured to: inputting all frame images of the video to be identified into the trained traffic sign identification model, and outputting a traffic sign identification result; the trained traffic sign recognition model is used for: extracting features of all frame images of the video to be identified to obtain feature images of all frame images; extracting traffic sign candidate areas from the feature images of all the frame images; carrying out feature fusion on the traffic sign candidate region of the key frame image and the traffic sign candidate region of the adjacent frame image to obtain a fused traffic sign candidate region; extracting feature images of the fused traffic sign candidate areas to generate fused candidate area feature images; and classifying and regressing the fused candidate region feature images to obtain a traffic sign recognition result.

In still another aspect, there is provided an electronic device including:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer-readable instructions,

wherein the computer readable instructions, when executed by the processor, perform the method of the first aspect described above.

In yet another aspect, a storage medium is provided that non-transitory stores computer readable instructions, wherein the instructions of the method of the first aspect are performed when the non-transitory computer readable instructions are executed by a computer.

In a further aspect, there is also provided a computer program product comprising a computer program for implementing the method of the first aspect described above when run on one or more processors.

One of the above technical solutions has the following advantages or beneficial effects:

determining key frames in a video stream from a pre-acquired video; then, from the time span of adjacent key frames, the relevance of the context information is found, an RS loss function is introduced to replace the original classification loss function, and the problem of unbalanced traffic sign types is solved. In order to better compensate the semantic information of the blocked target image, a graph convolutional neural network is introduced to be based on a correlation coefficient matrix, so that the model recognition rate is improved.

According to the application, a similarity-based attention mechanism is introduced into a network structure, a context feature memory is constructed, and missing semantic information is supplemented, so that the recognition effect under occlusion is improved. The application introduces the RS Loss function to replace the original classification Loss function, thereby solving the problem of unbalanced traffic sign category. According to the application, the graph convolutional neural network is introduced into the network structure, and based on the correlation coefficient matrix, the semantic information of the blocked target image is compensated, and the model recognition rate is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application.

FIG. 1 is an overall flow chart of a first embodiment of the present application;

FIG. 2 is a similarity-based attention module according to a first embodiment of the present application;

FIG. 3 is a flowchart illustrating an overall inspection process according to a first embodiment of the present application;

FIG. 4 (a) is a standard convolution layer of a first embodiment of the present disclosure;

FIG. 4 (b) is a deformable convolution layer of a first embodiment of the present disclosure;

FIG. 5 is a diagram illustrating an overall network architecture according to a first embodiment of the present application;

FIG. 6 is a Fused-MBConv convolution structure according to a first embodiment of the present application.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, unless the context clearly indicates otherwise, the singular forms also are intended to include the plural forms, and furthermore, it is to be understood that the terms "comprises" and "comprising" and any variations thereof are intended to cover non-exclusive inclusions, such as, for example, processes, methods, systems, products or devices that comprise a series of steps or units, are not necessarily limited to those steps or units that are expressly listed, but may include other steps or units that are not expressly listed or inherent to such processes, methods, products or devices.

Example 1

The embodiment provides a traffic sign recognition method;

as shown in fig. 1 and 3, a traffic sign recognition method includes:

s101: acquiring a video to be identified;

s102: dividing all frame images of the video to be identified into key frame images and adjacent frame images; the key frame image refers to a frame which is different from the scene of the previous frame image but the same as the scene of the next frame image; the adjacent frame images refer to an image between a current key frame image and a previous key frame image and an image between the current key frame image and a next key frame image;

s103: inputting all frame images of the video to be identified into the trained traffic sign identification model, and outputting a traffic sign identification result; the trained traffic sign recognition model is used for:

extracting features of all frame images of the video to be identified to obtain feature images of all frame images;

extracting traffic sign candidate areas from the feature images of all the frame images;

carrying out feature fusion on the traffic sign candidate region of the key frame image and the traffic sign candidate region of the adjacent frame image to obtain a fused traffic sign candidate region;

extracting feature images of the fused traffic sign candidate areas to generate fused candidate area feature images;

and classifying and regressing the fused candidate region feature images to obtain a traffic sign recognition result.

Further, the step S101: and acquiring a video to be identified, and acquiring the video of the traffic sign image by adopting a vehicle-mounted camera.

Further, as shown in fig. 5, S103: inputting all frame images of the video to be identified into the trained traffic sign identification model, and outputting a traffic sign identification result; wherein, the traffic sign recognition model after training, the network structure includes:

the backbone network and the candidate area generating network are connected in sequence;

the output end of the candidate region generation network is connected with the input end of the similar attention module through a first branch, and a region of interest pooling layer is arranged on the first branch; the input end of the region of interest Pooling layer RoI Pooling is connected with the output end of the candidate region generation network, and the output end of the region of interest Pooling layer RoI Pooling is connected with the input end of the similar attention module;

the output end of the candidate region generation network is connected with the input end of the similar attention module through a second branch;

the output end of the similar attention module is connected with the input end of the graph convolution neural network;

the output end of the graph convolution neural network is connected with the input end of the first full-connection layer;

and the output end of the first full-connection layer is respectively connected with the regressor and the classifier.

Further, as shown in fig. 6, the backbone network includes: the first 3*3 convolution layer, the SE layer, the first 1*1 convolution layer, the adder, the activation function layer and the pooling layer are sequentially connected;

the input of the 3*3 convolution layer is also connected to the input of the adder.

The first 3*3 convolution layer is implemented using a deformable convolution.

The SE (Squeeze-and-specification) module is a module for enhancing the characteristic representation capability of a Convolutional Neural Network (CNN), and compresses the characteristic map of each channel into a scalar through a global averaging pooling operation to obtain global statistical information of the channel. Then, a channel weight vector is generated by two fully connected layers (FC layers) for weighting the feature map of each channel.

It should be appreciated that this embodiment employs EfficientNetV2 as the backbone feature extraction network. The core module of EfficientNetV2 uses Fused-MBConv convolution as shown in FIG. 6, and replaces the 3X 3 depth convolution and 1X 1 convolution in MBConv with conventional 3X 3 convolution, which has less parameter and computation as shown in FIG. 5. The convolution kernel of the convolution layer in the Efficient Net is of a fixed size and a fixed size, the fixed convolution kernel has poor feature extraction capability on a morphological change target, and deformable convolution is introduced into a feature extraction network to enhance the adaptability of the network aiming at the characteristics of multiple scales of traffic signs and the problem that features are difficult to extract. The main feature extraction network is improved, and part of standard convolution is replaced by deformable convolution, so that the extraction capability of the network on the multi-scale traffic sign is enhanced.

The deformable convolution, as shown in fig. 4 (a) and fig. 4 (b), adds an additional convolution layer to the standard convolution to act on the input feature map, so as to learn the offset of each sampling point, then adds the learned offset to the original convolution kernel to change the standard convolution window into an offset window, and then performs the conventional convolution operation on the input feature map. The deformable convolution enhances the network's ability to extract small target traffic signs.

The application introduces a deformable convolution in the backbone network, which is to add a convolution layer to the standard convolution. The input of the deformable convolution is a feature map, the offset of each sampling point is learned, the learned offset is added to the standard convolution to enable a standard convolution window to be changed into an offset window, and then the deformable convolution operation is carried out on the input feature map. The deformable convolution disclosed by the application effectively enhances the extraction capability of the network to the small target traffic sign.

Further, as shown in fig. 5, the candidate area generating network includes:

a second 3*3 convolution layer, a second 1*1 convolution layer, a first Reshape layer, an activation function layer, a second Reshape layer and a Propos al layer which are connected in sequence;

the output of the second 3*3 convolution layer is also connected to the input of the second Reshape layer by a third 1*1 convolution layer.

It should be appreciated that the Reshape layer, a common layer in neural networks, has the primary function of reshaping the input tensor to a specified shape. The number of elements of its input and output tensors remains unchanged, but the shape may vary.

It should be understood that the Propos layer has the main function of generating a series of candidate boxes (bounding boxes) from the input feature map for use in subsequent target classification and location regression networks.

Further, as shown in fig. 2, the similar attention module includes:

three sub-branches in parallel: a first sub-branch, a second sub-branch, and a third sub-branch;

the first sub-branch comprises: the pooling layer P1, the second full-connection layer and the normalization layer G1 are sequentially connected;

the second sub-branch comprises: a third full-connection layer and a normalization layer G2 which are connected in sequence;

the third sub-branch comprises: a fourth full connection layer;

the input end of the pooling layer P1 is connected with the output end of the pooling layer of the region of interest;

the input end of the third full-connection layer and the input end of the fourth full-connection layer are connected with the output end of the candidate region generation network;

the output end of the normalization layer G1 and the output end of the normalization layer G2 are connected with the input end of the activation function layer S1; the output end of the activation function layer S1 is connected with the input end of the fifth full-connection layer;

the output end of the fourth full-connection layer is connected with the input end of the fifth full-connection layer;

the output of the fifth fully-connected layer is the output of the similar attention module.

Further, as shown in fig. 5, S103: inputting all frame images of the video to be identified into the trained traffic sign identification model, and outputting a traffic sign identification result; the training process of the traffic sign recognition model after training comprises the following steps:

constructing a training set, wherein the training set is a traffic sign video of a known traffic sign recognition result;

and inputting the training set into the traffic sign recognition model, training the model, and stopping training when the loss function value of the model is not reduced any more or the iteration number exceeds the set number, so as to obtain the trained traffic sign recognition model.

Further, the feature extraction is performed on all the frame images of the video to be identified to obtain feature graphs of all the frame images, including:

and carrying out feature extraction on all frame images of the video to be identified based on the backbone network to obtain feature images of all frame images.

Further, the extracting the traffic sign candidate frame and the traffic sign candidate region from the feature map of all the frame images includes:

and generating a network based on the candidate region, and extracting traffic sign candidate frames and traffic sign candidate regions from the feature images of all the frame images.

Further, the feature fusion is performed on the traffic sign candidate region of the key frame image and the traffic sign candidate region of the adjacent frame image to obtain a fused traffic sign candidate region, which comprises the following steps:

and based on the similar attention module, carrying out feature fusion on the traffic sign candidate region of the key frame image and the traffic sign candidate region of the adjacent frame image to obtain a fused traffic sign candidate region.

Further, the feature map extracting for the fused traffic sign candidate region, generating a fused candidate region feature map, includes:

and extracting feature images of the fused traffic sign candidate areas based on the graph convolution neural network, and generating fused candidate area feature images.

Further, classifying and regressing the traffic sign candidate areas of the fused candidate area feature map and the key frame to obtain a traffic sign recognition result, including:

and classifying and regressing the traffic sign candidate areas of the fused candidate area feature images and the key frames based on the regressing device and the classifier to obtain a traffic sign recognition result.

The steps of this example are mainly divided into the following: firstly, acquiring the position range of a traffic sign target in a current video frame image through a vehicle-mounted camera, and determining a key frame x in the video; wherein the key frame means that the scene of the current frame is different from the scene of the previous frame and the scene of the next frame is the same; constructing a context storage space from the time spans of two adjacent key frames;

firstly, setting key frames; the key frame in the video sequence is selected as the first frame key frame, and a plurality of adjacent frames are determined from the context memory space of the time span corresponding to the designated frame.

Because the traffic sign image captured under the shielding of the vehicle-mounted camera probe has the phenomenon of partial critical semantic information missing, in order to obtain a better detection effect, a GCN module is added in the basic network to extract the feature map. In the GCN module, the feature map is converted into a graph structure, and feature extraction and graph analysis are performed by using a map convolution network.

In addition, when certain task imbalances occur during the multitasking training process, such as imbalance of positive and negative samples in the classification task, additional superparameters are generated whose fine tuning takes up computation time and can lead to suboptimal results. Aiming at the problem of unbalanced data samples, the embodiment optimizes the classification Loss function, introduces an RS Loss function instead of the original classification Loss function, solves the problem of unbalanced traffic sign types, and improves the model recognition rate.

When the vehicle-mounted camera collects video streams under small part shielding, and targets are detected aiming at the obtained video streams, the problem of low recognition rate caused by shielding of the vehicle-mounted camera is solved by analyzing that the targets are in a static state and the vehicle-mounted camera is in a continuously moving state, and the method of utilizing the context relation complement feature and the graph convolution neural network feature prediction is fused into a detection model, so that the detection precision is improved, and the unmanned vehicle can be conveniently and better analyzed and judged.

In addition, when certain task imbalances occur during the multitasking training process, such as imbalance of positive and negative samples in the classification task, additional superparameters are generated whose fine tuning takes up computation time and can lead to suboptimal results. Aiming at the problem of unbalanced data samples, the embodiment optimizes the classification Loss function, introduces the RS Loss function to replace the original classification Loss function, solves the problem of unbalanced traffic sign types, and improves the model recognition rate. The loss function consists of classification loss and bounding box regression loss, and when certain task imbalance occurs in the multi-task training process, such as imbalance of positive and negative samples in the classification task, additional superparameters are generated, and fine tuning of the superparameters takes up computation time and can lead to suboptimal results. For the problem of data sample imbalance, rank & Sort Loss (RS Loss) is introduced in the Loss function, RS Loss ranks positive samples over negative samples when calculating losses and orders the positive samples inside them according to the size of IoU value. Based on the ordering attribute, RS Loss can handle unbalanced data and simplify the training model. The RS Loss function is used herein to replace the Faster R-CNN primitive classification Loss function. The problem of unbalanced traffic sign category is solved, and the model recognition rate is improved.

Determining key frames in a video stream from a pre-acquired video; then, from the time span of the adjacent key frames, the relevance of the context information is found, and a context memory M is constructed; the ability to enhance feature extraction using deformable convolution instead of part of the standard convolution; RS Loss is introduced to replace the original classification Loss function, so that the problem of unbalanced traffic sign categories is solved, and the model recognition rate is improved. The multi-label classification model based on the graph rolling network (Graph Convolutional Network, GCN) is introduced, and the mutual dependence relationship among labels is learned, so that the accuracy of regional proposal is improved, the performance of the whole target detection model is further improved, and the problem that the recognition rate of traffic signs under the shielding of small parts of unmanned vehicle-mounted cameras is low is solved.

Example two

The embodiment provides a traffic sign recognition system;

a traffic sign recognition system, comprising:

an acquisition module configured to: acquiring a video to be identified;

It should be noted that, the above-mentioned obtaining module, key frame dividing module and traffic sign identifying module correspond to steps S101 to S103 in the first embodiment, and the above-mentioned modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

The foregoing embodiments are directed to various embodiments, and details of one embodiment may be found in the related description of another embodiment.

The proposed system may be implemented in other ways. For example, the system embodiments described above are merely illustrative, such as the division of the modules described above, are merely a logical function division, and may be implemented in other manners, such as multiple modules may be combined or integrated into another system, or some features may be omitted, or not performed.

Example III

The embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein the processor is coupled to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software.

The method in the first embodiment may be directly implemented as a hardware processor executing or implemented by a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Example IV

The present embodiment also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the method of embodiment one.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A traffic sign recognition method, comprising:

acquiring a video to be identified;

2. The traffic sign recognition method according to claim 1, wherein all frame images of the video to be recognized are input into the trained traffic sign recognition model, and the traffic sign recognition result is output; wherein, the traffic sign recognition model after training, the network structure includes:

3. The traffic sign recognition method of claim 2, wherein the backbone network comprises: the first 3*3 convolution layer, the SE layer, the first 1*1 convolution layer, the adder, the activation function layer and the pooling layer are sequentially connected; the input end of the 3*3 convolution layer is also connected with the input end of the adder; the first 3*3 convolution layer is implemented using a deformable convolution.

4. The traffic sign recognition method according to claim 2, wherein the candidate area generation network comprises: a second 3*3 convolution layer, a second 1*1 convolution layer, a first Reshape layer, an activation function layer, a second Reshape layer and a Propos al layer which are connected in sequence; the output of the second 3*3 convolution layer is also connected to the input of the second Reshape layer by a third 1*1 convolution layer.

5. The traffic sign recognition method according to claim 2, wherein the similar attention module comprises: three sub-branches in parallel: a first sub-branch, a second sub-branch, and a third sub-branch;

the first sub-branch comprises: the pooling layer P1, the second full-connection layer and the normalization layer G1 are sequentially connected; the second sub-branch comprises: a third full-connection layer and a normalization layer G2 which are connected in sequence; the third sub-branch comprises: a fourth full connection layer;

6. The traffic sign recognition method according to claim 1, wherein all frame images of the video to be recognized are input into the trained traffic sign recognition model, and the traffic sign recognition result is output; the training process of the traffic sign recognition model after training comprises the following steps:

7. The traffic sign recognition method according to claim 2, wherein the feature extraction is performed on all frame images of the video to be recognized to obtain feature images of all frame images, and the method comprises:

based on a backbone network, extracting features of all frame images of the video to be identified to obtain feature images of all frame images;

and extracting traffic sign candidate frames and traffic sign candidate areas from the feature images of all the frame images, wherein the method comprises the following steps of:

generating a network based on the candidate region, and extracting traffic sign candidate frames and traffic sign candidate regions from feature images of all frame images;

the feature fusion is carried out on the traffic sign candidate region of the key frame image and the traffic sign candidate region of the adjacent frame image to obtain a fused traffic sign candidate region, which comprises the following steps:

based on the similar attention module, carrying out feature fusion on the traffic sign candidate region of the key frame image and the traffic sign candidate region of the adjacent frame image to obtain a fused traffic sign candidate region;

extracting the feature map of the fused traffic sign candidate region to generate a fused candidate region feature map, which comprises the following steps:

based on the graph convolution neural network, extracting feature graphs of the fused traffic sign candidate areas to generate fused candidate area feature graphs;

classifying and regressing the traffic sign candidate areas of the fused candidate area feature images and the key frames to obtain a traffic sign recognition result, wherein the method comprises the following steps:

8. A traffic sign recognition system, comprising:

an acquisition module configured to: acquiring a video to be identified;

9. An electronic device, comprising:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer-readable instructions,

wherein the computer readable instructions, when executed by the processor, perform the method of any of the preceding claims 1-7.

10. A storage medium, characterized by non-transitory storing computer-readable instructions, wherein the instructions of the method of any one of claims 1-7 are performed when the non-transitory computer-readable instructions are executed by a computer.