CN115273032A

CN115273032A - Traffic sign recognition method, apparatus, device and medium

Info

Publication number: CN115273032A
Application number: CN202210919319.9A
Authority: CN
Inventors: 李德辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2022-11-01

Abstract

The application discloses a traffic sign identification method, a device, equipment and a medium, wherein the method comprises the following steps: carrying out traffic sign body detection on the traffic image to be detected to obtain a traffic body area image corresponding to the traffic image to be detected; determining N anchor frames in the traffic body area image, and detecting a traffic sign in an area corresponding to each anchor frame to obtain a traffic sign area image corresponding to the traffic sign area image; and carrying out mark type detection on the traffic mark area image to obtain the type information of the traffic mark in the traffic image to be detected, and obtaining the traffic mark identification result of the traffic image to be detected based on the type information. According to the technical scheme, the image characteristics of each traffic sign area in the traffic volume area image can be extracted in a finer granularity mode, the sign type of the traffic sign area image is detected, the traffic sign identification result of the traffic image to be detected is determined by combining the finer and more comprehensive characteristics, and the accuracy of the traffic sign identification result is improved.

Description

Traffic sign recognition method, apparatus, device and medium

Technical Field

The invention relates to the technical field of intelligent traffic, in particular to a traffic sign identification method, a device, equipment and a medium.

Background

With the continuous development of scientific technology, intelligent transportation technologies such as automatic driving, assisted driving, intelligent navigation and the like are increasingly applied to the daily life of users. In the application of intelligent traffic technology, identification of traffic signs such as traffic lights is important in order to accurately perform intelligent navigation on users and ensure safety of automatic driving and auxiliary driving.

At present, in the related art, an image shot in a driving process can be input into a classification model, the classification model firstly detects the position of a traffic light in the image, cuts the image to obtain a detection frame area, and then classifies and identifies the detection frame area to obtain an identification result of the traffic light.

However, there are usually a plurality of traffic lights on the lamp body, and when the plurality of traffic lights are on, the semantic expression is often complicated. At present, the semantics of the traffic light cannot be exhausted, model training is carried out based on the limited semantics, so that the performance of the model is poor, the multi-attribute semantics of the light body cannot be identified in practical application, and the accuracy of identifying the traffic light is low.

Disclosure of Invention

In view of the above defects or shortcomings in the prior art, it is desirable to provide a method, a device, an apparatus, and a medium for identifying a traffic sign, which can extract image features of each traffic sign region in an image of a traffic sign body region in a finer granularity, thereby improving accuracy of a traffic sign identification result of a traffic sign to be detected. The technical scheme is as follows:

according to one aspect of the application, a traffic sign recognition method is provided, the method comprising:

detecting a traffic sign body of a traffic image to be detected to obtain a traffic body area image corresponding to the traffic image to be detected;

determining N anchor frames in the traffic sign area image, and detecting a traffic sign in an area corresponding to each anchor frame to obtain a traffic sign area image corresponding to the traffic sign area image; the size and the position corresponding to each anchor frame in the N anchor frames are different, and N is an integer greater than or equal to 1;

and performing sign type detection on the traffic sign area image to obtain the type information of the traffic sign in the traffic sign area image to be detected, and obtaining the traffic sign identification result of the traffic sign area image to be detected based on the type information.

According to another aspect of the present application, there is provided a traffic sign recognition apparatus, including:

the traffic sign detection module is used for detecting a traffic sign body of a traffic image to be detected to obtain a traffic body area image corresponding to the traffic image to be detected;

the mark area detection module is used for determining N anchor frames in the traffic sign area image, and carrying out traffic sign detection on the area corresponding to each anchor frame to obtain the traffic sign area image corresponding to the traffic sign area image; the size and the position corresponding to each anchor frame in the N anchor frames are different, and N is an integer greater than or equal to 1;

and the identification module is used for carrying out mark type detection on the traffic mark area image to obtain the type information of the traffic mark in the traffic image to be detected and obtaining the traffic mark identification result of the traffic image to be detected based on the type information.

According to another aspect of the present application, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of identifying a traffic sign as described above when executing the program.

According to another aspect of the present application, there is provided a computer-readable storage medium having stored thereon a computer program for implementing the traffic sign recognition method as described above.

According to another aspect of the present application, there is provided a computer program product comprising instructions thereon which, when executed, implement a traffic sign recognition method as described above.

The traffic sign recognition method, the traffic sign recognition device, the traffic sign recognition equipment and the traffic sign recognition medium provided in the embodiment of the application obtain a traffic sign area image corresponding to a traffic image to be detected by carrying out traffic sign body detection on the traffic image to be detected, then determine N anchor frames in the traffic area image, carry out traffic sign detection on an area corresponding to each anchor frame to obtain a traffic sign area image corresponding to the traffic sign area image, carry out sign type detection on the traffic sign area image to obtain type information of a traffic sign in the traffic image to be detected, and obtain a traffic sign recognition result of the traffic image to be detected based on the type information. Compared with the prior art, on one hand, after the guidance information of the traffic body area image is obtained, the traffic sign detection can be carried out on the area corresponding to each anchor frame more finely by determining the N anchor frames in the traffic body area image, the image features of each traffic sign area in the traffic body area image are extracted more finely, so that the traffic signs in the image can be identified based on more detailed features, and the identification accuracy of the traffic signs can be effectively improved. On the other hand, the method has the advantages that the mark type detection is carried out on the traffic mark area image, the traffic mark identification result of the traffic image to be detected is determined by combining more fine and comprehensive characteristics, and the identification accuracy of the method provided by the application can be obviously improved compared with the prior art to a certain extent.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is a schematic diagram illustrating a comparative structure between a sample obtained by enhancement and an actual sample in the related art provided in an embodiment of the present application;

fig. 2 is a system architecture diagram of an application system for traffic sign recognition provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of a traffic sign identification method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of traffic sign recognition provided in an embodiment of the present application;

fig. 5 is a schematic flowchart of a method for obtaining a traffic sign area according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of obtaining an output parameter through a detection network according to an embodiment of the present application;

fig. 7 is a schematic internal structural diagram of a detection network according to another embodiment of the present application;

fig. 8 is a schematic structural diagram illustrating preprocessing of an image of a traffic sign body area according to an embodiment of the present disclosure;

fig. 9 is a schematic flowchart of a method for training a detection network according to an embodiment of the present application;

FIG. 10 is a schematic flowchart of a method for obtaining a traffic sign recognition result according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of preprocessing an image of a traffic sign area according to an embodiment of the present disclosure;

fig. 12 is a schematic diagram of an internal structure of a classification network according to an embodiment of the present application;

fig. 13 is a schematic flow chart of a traffic sign identification method according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a traffic sign identification method according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of an area for acquiring a traffic sign according to an embodiment of the present application;

FIG. 16 is a schematic structural diagram of obtaining a traffic sign recognition result according to an embodiment of the present application;

FIG. 17 is a schematic structural diagram of a traffic sign recognition device according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of a traffic sign recognition apparatus according to an embodiment of the present application;

fig. 19 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings. For convenience of understanding, some technical terms related to the embodiments of the present application are explained below:

(1) Artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software mainly comprises computer vision, a voice processing technology, a natural language technology, machine learning/deep learning and the like.

(2) Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how the computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, and is the fundamental approach to make computers have intelligence, and the application of the artificial intelligence is spread in various fields. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

(3) Convolutional Neural Network (CNN): the method is a feed-forward Neural Network (feed-forward Neural Network) which contains convolution calculation and is formed by a deep structure, and is one of representative algorithms of deep learning. The convolutional neural network has the capability of representing learning, and can carry out translation invariant classification on input information according to the hierarchical structure of the convolutional neural network.

(4) Traffic signs: refers to assets that convey guidance, restrictions, warnings, or instructional information in text or symbols. Also known as road signs, road traffic signs. Traffic signs are of multiple types and can be distinguished in various ways: primary and secondary signs; movable signs and fixed signs; illuminated, light-emitting and light-reflecting signs; and variable information signs reflecting the driving environment change, such as speed limit boards, traffic lights and the like in roads.

(5) Traffic signal lights: the traffic signal lamp is generally composed of red, green and yellow lamps, wherein the red lamp indicates no way, the green lamp indicates pass, and the yellow lamp indicates warning. The method can be divided into the following steps: motor vehicle signal lamps, non-motor vehicle signal lamps, pedestrian crossing signal lamps, direction indicating line indicator lamps, lane signal lamps, flashing warning signal lamps and road and railway level intersection signal lamps.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to the technology of artificial intelligence neural networks and the like, and is specifically explained by the following embodiment.

At present, in the related art, a two-stage method may be adopted, in which the position of a traffic light is detected, then a detection frame area is obtained by cutting, and the cutting area is classified to obtain an identification result of the traffic light, where each traffic light is regarded as a classification category. However, in the process of determining the recognition result, each lamp body is regarded as a category, so each lamp body only corresponds to one semantic meaning, when a plurality of lamps on one lamp body are lighted (more lighted), the multi-attribute semantic meaning of the lamp body cannot be recognized, in addition, if the plurality of lighted lamps are defined as a single category, for example, when a left-turn red lamp and a straight red lamp are simultaneously lighted on one lamp body, the category of the lamp body can be defined as a left-turn red lamp and a straight red lamp, the number of categories obtained by combining the different shapes, colors, numbers and types of the plurality of traffic lamps is large, and the categories of the classification result cannot be completely listed, so that the model performance is poor, and the accuracy of the recognition of the traffic lamps is low. Meanwhile, because the number of rare type lamp samples is too small, when a data set is established, the rare type lamp samples need to be obtained through enhancement processing according to the existing samples, but the sample effect obtained through enhancement is not consistent with the actual sample condition, please refer to fig. 1, for example, a vertical right-turning red lamp can be obtained by turning over a vertical left-turning red lamp, but for a horizontal right-turning red lamp, the enhanced samples obtained through enhancement (turning over or rotation) are not consistent with the actual horizontal right-turning red lamp, so that the number of horizontal right-turning samples is insufficient, and the accuracy of model identification for training is low.

Based on the above defects, the present application provides a method, an apparatus, a device, and a medium for identifying a traffic sign, which, compared with the prior art, can extract image features of each traffic sign region in a traffic body region image at a finer granularity by performing traffic sign detection after obtaining guidance information of the traffic body region image, and can determine a traffic sign identification result of a traffic image to be detected by combining finer and more comprehensive features by performing sign type detection on the traffic sign region image, thereby improving accuracy of the traffic sign identification result.

Fig. 2 is an implementation environment architecture diagram of a traffic sign identification method according to an embodiment of the present application. As shown in fig. 2, the implementation environment architecture includes: a terminal 10 and a server 20.

In the field of image recognition, the process of recognizing the traffic sign in the traffic image to be detected may be executed in the terminal 10 or the server 20. For example, the traffic image to be detected is acquired through the terminal 10, and image recognition can be performed locally on the terminal 10 to obtain a recognition result of the traffic sign to be recognized; or the traffic image to be detected may be sent to the server 20, so that the server 20 obtains the traffic image to be detected, performs image recognition according to the traffic image to be detected to obtain a recognition result of the traffic sign to be recognized, and then sends the recognition result of the traffic sign to be recognized to the server 20, so as to realize type result recognition of the traffic sign to be recognized in the traffic image to be detected.

The traffic sign identification scheme provided by the embodiment of the application can be applied to common automatic driving, vehicle navigation scenes, map data acquisition scenes, road data acquisition scenes, intelligent traffic scenes, auxiliary driving scenes and the like. In the application scenario, it is usually necessary to acquire a road live-action image, analyze the road live-action image to obtain information such as a recognition result of a traffic sign, and perform subsequent operations based on the information, such as map updating, travel route planning, and vehicle automatic driving control.

In addition, the terminal 10 may have an operating system running thereon, where the operating system may include, but is not limited to, an android system, an IOS system, a Linux system, a Unix system, a windows system, and the like, and may further include a User Interface (UI) layer, where the UI layer may provide a display of a traffic real image and a display of a recognition result of a traffic sign to be recognized, and may further send the traffic real image required for image recognition to the server 20 based on an Application Programming Interface (API).

Alternatively, the terminal 10 may be a terminal device in various AI application scenarios. For example, the terminal 10 may be a notebook computer, a tablet computer, a desktop computer, a vehicle-mounted terminal, an intelligent voice interaction device, an intelligent household appliance, a mobile device, an aircraft, and the like, and the mobile device may be various types of terminals such as a smart phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device, and the like, which is not limited in this embodiment of the application.

The server 20 may be one server, a server cluster or a distributed system composed of a plurality of servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content Delivery Network (CDN), and a big data and artificial intelligence platform.

The terminal 10 and the server 20 establish a communication connection therebetween through a wired or wireless network. Optionally, the wireless network or wired network described above uses standard communication techniques and/or protocols. The Network is typically the Internet, but may be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile, wireline or wireless Network, a private Network, or any combination of virtual private networks.

For convenience of understanding and explanation, the traffic sign identification method, apparatus, device and medium provided by the embodiments of the present application are described in detail below with reference to fig. 3 to 19.

Fig. 3 is a flowchart illustrating a traffic sign recognition method according to an embodiment of the present application, where the method may be executed by a computer device, and the computer device may be the server 20 or the terminal 10 in the system shown in fig. 2, or the computer device may also be a combination of the terminal 10 and the server 20. As shown in fig. 3, the method includes:

s101, detecting a traffic sign body of the traffic image to be detected to obtain a traffic body area image corresponding to the traffic image to be detected.

The traffic image to be detected can be an image obtained by shooting a traffic panorama containing a traffic sign. The traffic image to be detected may include a traffic sign to be detected and may also include background information. Traffic signs are the primary content that constitutes an image that identifies some alert condition that exists in a geographic environment. Traffic signs may include illuminated warning signs, light emitting warning signs, reflective warning signs, etc., which may be traffic lights, for example. In addition, the background information refers to image information of the traffic image to be detected except for the traffic sign to be detected, and may be, for example, a vehicle, a road, a pole, a building, the sky, the ground, a tree, and the like.

The traffic image to be detected may be an image captured in various scenes, for example, an image captured in a scene containing a traffic sign on different roads, different weather, and different angles.

Illustratively, the traffic signs may include, for example, arrow type indication such as straight running, left turning, right turning, turning around, continuous downhill, etc., and may also include a bus pattern indicating a bus lane, a bicycle pattern indicating a non-motor lane, a car pattern indicating a motor lane, a step pattern indicating a street overpass, an underground passage, etc.

In the embodiment of the application, the image acquisition device can be called to acquire the image of the picture containing the traffic sign to acquire the traffic image to be detected, the image can also be acquired through a cloud end, the traffic image to be detected can also be acquired through a database or a block chain, and the traffic image to be detected can also be acquired through external equipment introduction.

In a possible implementation manner, the image capturing device may be a video camera or a still camera, or may be a radar device such as a laser radar or a millimeter wave radar, and the image capturing device is an image capturing device located in a vehicle, such as a car recorder, a video camera mounted on a windshield of the vehicle and having a lens facing a driving direction of the vehicle, or the like.

The camera may be a monocular camera, a binocular camera, a depth camera, a three-dimensional camera, etc. Optionally, in the process of acquiring an image by using a camera, the camera may be controlled to start a camera shooting mode, so as to scan a traffic sign in the field of view of the camera in real time, shoot at a specified frame rate, obtain a road video, and process to generate a traffic image to be detected. In the process of image acquisition through radar equipment, a detection signal can be transmitted to a traffic sign in real time through the radar equipment, then an echo signal reflected by the traffic sign is received, the characteristic data of the traffic sign is determined based on the difference between the detection signal and the echo signal, and a traffic image to be detected is determined based on the characteristic data.

It should be noted that the traffic image to be detected may be in an image sequence format, a three-dimensional point cloud image format, or a video image format.

In one embodiment, after the traffic image to be detected is acquired, the computer device may perform traffic sign body detection on the traffic image to be detected through a preset feature extraction rule, so as to obtain a traffic body area image corresponding to the traffic image to be detected. The traffic body area image may include a traffic body target area, which is an image including only traffic bodies. The target region may be a rectangular region, a circular region, a triangular region, or the like.

Optionally, the feature extraction rule refers to a feature extraction strategy preset for an image to be recognized according to an actual application scenario, and the feature extraction strategy may be a traffic body region prediction model after training, or may be a general feature extraction algorithm, or the like. As an implementation mode, the feature extraction processing can be carried out on the image to be recognized through the traffic body region prediction model, and the target region of the traffic sign body in the traffic image to be detected is obtained. The traffic sign body area prediction model is a network structure model with the traffic sign body extraction capability, which is learned by training sample data. The traffic body area prediction model is a neural network model which is input as a traffic image to be detected and output as a traffic body target area of a traffic body in the traffic image to be detected, namely the traffic body area prediction model is output as a traffic body area image, has the capability of carrying out image recognition on the traffic image to be detected and can predict the traffic body area image. The traffic body area prediction model can comprise a multilayer network structure, the network structures of different layers carry out different processing on the data input into the traffic body area prediction model, and the output result is transmitted to the next network layer until the data is processed by the last network layer, so that the traffic body target area of the traffic body in the traffic image to be detected is obtained.

It should be noted that, each implementation manner of performing traffic sign body detection on the traffic image to be detected to obtain the traffic body area image corresponding to the traffic image to be detected is only an example, and the embodiment of the present application does not limit this.

In one embodiment, considering that the computing resources provided by the computer device are limited, in order to reduce the amount of computation in the image processing process, after the traffic image to be detected is acquired, the computer device may compress the traffic image to be detected, which is directly acquired, to obtain a processed traffic image to be detected.

Optionally, a lossy compression mode or a lossless compression mode may be adopted to compress the traffic image to be detected with a higher resolution, and the compression ratio is increased by ignoring secondary information that is not sensitive to human vision, so as to obtain the processed traffic image to be detected.

As an implementation manner, a resolution threshold may be preset according to actual processing requirements of computing resources, after the traffic image to be detected is acquired, the resolution of the traffic image to be detected may be determined, and when the resolution of the traffic image to be detected is smaller than the resolution threshold, the traffic image to be detected does not need to be compressed, and a subsequent processing flow may be directly performed. And when the resolution of the traffic image to be detected is not less than the resolution threshold, compressing the traffic image to be detected to obtain a processed traffic image to be detected, and then performing a subsequent processing flow.

In the embodiment, the traffic image to be detected is compressed, so that the data volume represented by the traffic image to be detected can be reduced, image identification processing is performed by using fewer computing resources, and the computing resources are saved.

In the embodiment, the traffic sign body detection is carried out on the traffic image to be detected, so that the target area of the traffic body in the traffic image to be detected can be accurately obtained, and the traffic body area image is obtained, and therefore the image characteristics of the traffic sign area image with finer granularity can be obtained aiming at the correct traffic body area image, and the determined traffic sign identification result is more accurate.

S102, determining N anchor frames in the traffic body area image, and detecting a traffic sign in an area corresponding to each anchor frame to obtain a traffic sign area image corresponding to the traffic sign area image; the corresponding size and position of each of the N anchor frames are different, and N is an integer greater than or equal to 1.

It should be noted that the anchor frame is used to select a preset region in the model input image, for example, a traffic body region may be selected, where the anchor frame refers to a plurality of prior frames defined according to a preset algorithm with an anchor point as a center, and the shape of the prior frames may be, for example, a rectangle, a triangle, a diamond, a circle, or the like.

The N anchor frames are obtained by training based on historical traffic body images and real category results of the historical traffic body images, and the historical traffic body images contain marked traffic body areas. When the shape of the prior frame is a rectangle, a plurality of prior frames with different size and aspect ratios, which are generated by taking each pixel as a center, can be used, and the size and the aspect ratio are obtained by training based on the historical traffic volume image and the real category result of the historical traffic volume image. The historical traffic volume image includes a marked traffic volume region. When the shape of the prior frame is a circle, a plurality of prior frames with different radii generated by taking each pixel as a center can be used, and the radii are obtained based on the historical traffic volume image and the real category result training of the historical traffic volume image.

As an alternative embodiment, assuming that the width and height of the historical traffic volume image are w and h respectively, different images are generated respectively centering on each pixel of the imageThe shape of the anchor frame is set as s ∈ (0, 1)]And the aspect ratio is r>0, when the central position is determined, the corresponding anchor frame can be determined. For example, set a set of S separately ₁ 、S ₂ 、...、S _n And a set of aspect ratios r ₁ ,r ₂ 、...、r _m . If all combinations of size and aspect ratio are used, centered around each pixel, then whnm anchor boxes can be obtained. Although these anchor boxes may cover all real bounding boxes, the computational complexity is easily too high, and therefore, usually only for the inclusion of S ₁ And r ₁ Is of interest, namely: (S) ₁ ，r ₁ )，(S ₁ ，r ₂ )，(S ₁ ，r _m )，(S ₂ ，r ₁ )，(S ₃ ，r ₁ )，...，(S _n ，r ₁ ) That is, the number of anchor frames centered on the same pixel is n + m-1, a total of wh (n + m-1) anchor frames are generated for the entire historical traffic volume image.

As another implementation manner, historical traffic volume images can be obtained, then the aspect ratio of the target frames in the historical traffic volume images is counted, k aspect ratio values are generated by clustering through a k-mean method, and the aspect ratios of the anchor frames respectively correspond to k values generated by clustering, so that the anchor frames with different k types of shapes are obtained.

Specifically, after obtaining the traffic region image corresponding to the traffic image to be detected, the traffic region image may be subjected to traffic sign detection for performing traffic sign detection, for example, the traffic sign detection may be performed by using a feature detection strategy, so as to obtain the traffic sign region image corresponding to the traffic sign region image. Optionally, the feature detection policy refers to a policy preset according to an actual application scenario and used for feature extraction and detection, and may be a detection network obtained after training is completed, or may be a general feature extraction algorithm, or the like. As an implementation manner, the traffic sign detection may be performed on the traffic sign region image through the detection network, so as to obtain the image feature of the traffic sign region image corresponding to the traffic sign region image. The detection network learns a network structure model with feature extraction capability by training sample data. The input of the detection network is each sub-region in the multiple sub-regions, the output is the image characteristic of each sub-region, and the neural network model has the capability of carrying out image identification on each sub-region and can predict the image characteristic of each sub-region. The detection network can comprise a multilayer network structure, wherein the network structures of different layers carry out different processing on the data input into the detection network, and the output result is transmitted to the next network layer until the last network layer carries out processing to obtain the image characteristics of each subregion.

As another implementation manner, the image Feature of the traffic sign region image corresponding to the traffic sign region is obtained by performing traffic sign detection on the traffic sign region image through a Feature extraction algorithm, where the Feature extraction algorithm may be, for example, scale-Invariant Feature Transform (SIFT) algorithm, speeded Up Robust Features (SURF) algorithm, or ORB Feature detection (organized FAST and rotaed BRIEF, ORB).

It can be understood that feature extraction is respectively performed on the traffic sign body areas, and the obtained image features of the traffic sign area images are different.

As another implementation manner, a pre-established template image database may be queried, the image features of the traffic body area image are compared with the image features in the template image database, a part of the traffic body area image, which is in accordance with the template image features in the template image database, is determined, and then the part in accordance with the feature comparison is determined as a traffic sign target area of the traffic sign in the traffic body area image. The template image database can be flexibly configured according to the image characteristic information of the traffic signs in the actual application scene, and the traffic signs with different traffic body types, traffic body forms, structures and other characteristics are constructed after being collected and sorted.

It should be noted that, each implementation manner of performing the traffic sign detection on the traffic sign region image to obtain the traffic sign region image corresponding to the traffic sign region image is only used as an example, and the embodiment of the present application does not limit this.

In the embodiment, the traffic sign detection is performed on the traffic body area image, so that the image characteristics of each traffic sign area can be extracted in a finer granularity, and the accuracy of identifying the traffic sign identification result is further improved.

S103, carrying out sign type detection on the traffic sign area image to obtain the type information of the traffic sign in the traffic picture to be detected, and obtaining the traffic sign identification result of the traffic picture to be detected based on the type information.

The above-mentioned marker type detection is used to detect the type information of the traffic marker in the traffic marker area image, wherein the type information of the traffic marker includes shape information, color information, and state information of the traffic marker. Alternatively, the shape information may be, for example, straight, left turn, right turn, u-turn, etc., the color information may be, for example, red, yellow, and green, and the state information may be, for example, a still state and a motion state.

Specifically, after the traffic sign area image is obtained, the traffic sign area image can be input into a trained classification network, and a traffic sign recognition result of the traffic image to be detected is determined. Or a classification algorithm is adopted to determine the traffic sign recognition result.

The classification network is a neural network model capable of predicting the traffic sign recognition result, and the classification network learns the model structure with the traffic sign recognition capability through samples, inputs the traffic sign region image to the classification network, and outputs the traffic sign recognition result of the traffic image to be detected.

As an implementation manner, the classification network may include a full connection layer and an activation function, after the traffic sign region image is obtained, the traffic sign region image may be processed through the full connection layer to obtain full connection vector features, and the full connection vector features are processed by using the activation function to obtain a traffic sign recognition result, where the recognition result includes a plurality of traffic sign types, and may also include a plurality of type attributes under the traffic sign types.

As another implementation mode, the feature information in the traffic sign region image can be extracted in a mode of prior knowledge in a corresponding field, the feature information is clustered through a clustering algorithm to obtain clustering results, and then the recognition result of each clustering result in the feature information is determined by utilizing the artificial prior sign feature knowledge to obtain the traffic sign recognition result. The clustering algorithm may be a clustering function, and the clustering function may be, for example, mean, pool, LSTM, or the like.

As another implementation manner, a pre-established sign feature database of known traffic sign types may be queried, the feature information of unknown types extracted from the traffic sign region image is compared with the sign features of the sign feature database of known sign types, and the sign types with the same sign features are determined as the traffic sign recognition results of the traffic image to be detected. The sign feature library can be constructed by summarizing, classifying and sorting sign data of different traffic sign types, sign forms, sign structures and other features. The traffic sign recognition result is used for identifying the traffic sign in the traffic image to be detected, so that the information, the characteristics and the like of the traffic sign can be quickly acquired through the traffic sign recognition result.

For example, the traffic sign recognition result of the traffic image to be detected may include the traffic sign type of the traffic image to be detected, or may include a plurality of traffic sign attributes of the traffic image to be detected under the traffic sign type. Illustratively, the traffic sign type may be an automotive signal light, a non-automotive signal light, a crosswalk signal light, or the like. The traffic sign attribute corresponding to the signal lamp of the motor vehicle may be, for example, a straight green light, a left-turn red light, a right-turn green light, and the like. The traffic signs corresponding to different traffic sign attributes have different functions. For example, a straight-ahead green light sign has a function of indicating that the vehicle can go straight; the left turn red light mark has the function of indicating that the vehicle forbids turning left; the right turn green light sign has a function of indicating that the vehicle can turn right to pass.

Referring to fig. 4, when the traffic image 3-1 to be detected is obtained, the traffic sign body detection may be performed on the traffic image 3-1 to be detected to obtain a traffic area image 3-2 corresponding to the traffic image to be detected, then the traffic sign detection may be performed on the traffic area image 3-2 to obtain a traffic sign area image 3-3 corresponding to the traffic sign area image 3-2, the sign type detection may be performed on the traffic sign area image 3-3 to obtain the type information 3-4 of the traffic sign in the traffic image to be detected, and the traffic sign recognition result 3-5 of the traffic image to be detected may be obtained based on the type information 3-4.

Compared with the prior art, on one hand, the traffic sign identification method provided in the embodiment of the application facilitates more fine traffic sign detection on the area corresponding to each anchor frame by determining the N anchor frames in the traffic body area image after the guidance information of the traffic body area image is obtained, so that the image features of each traffic sign area in the traffic body area image can be extracted more finely, the traffic signs in the image can be identified based on more detailed features, and the identification accuracy of the traffic signs can be effectively improved. On the other hand, by detecting the mark type of the traffic mark area image, the traffic mark identification result of the traffic image to be detected can be determined by combining more precise and comprehensive characteristics, and the accuracy of the traffic mark identification result in the traffic image to be detected is improved to a great extent. The method can also be applied to an automatic driving system, and can accurately predict the traffic sign in the traffic image to be detected, thereby greatly improving the quality and efficiency of traffic sign identification and providing powerful support for accurate navigation and road condition analysis and processing of automatic driving.

In another embodiment of the application, the traffic sign detection may be performed on the traffic sign region image to obtain a traffic sign region image corresponding to the traffic sign region image. Fig. 5 provides a specific implementation of obtaining a traffic sign region image corresponding to a traffic sign body region image. Please refer to fig. 5, which specifically includes:

s201, inputting the traffic body area image into a detection network, and performing feature extraction on the area corresponding to each anchor frame to obtain the position offset and the size offset corresponding to each anchor frame in the traffic body area image.

The detection network is a neural network model that inputs the traffic region image, outputs the recognition results of the position offset and the size offset corresponding to the anchor frame, has the capability of detecting the traffic region of the traffic region image, and can predict the position offset and the size offset corresponding to each anchor frame. The detection network is used for establishing the relation between the position offset and the size offset corresponding to the image of the traffic body area and the anchor frame, and the model parameters of the detection network are in the optimal state.

The detection network may include, but is not limited to, a convolutional layer, a normalization layer, and an activation function, and the convolutional layer, the normalization layer, and the activation function may include one layer or may also include multiple layers. The convolution layer is used for extracting the characteristics of the edge and texture characteristics in the traffic body area image; the normalization layer is used for performing normalization processing on the image characteristics obtained by the convolution layer, for example, the image characteristics minus the mean value can be divided by the variance to obtain normal distribution with the mean value being zero and the variance being one, so that gradient explosion and gradient disappearance can be prevented; the activation function may be a Sigmoid function, a Tanh function, or a ReLU function, and the result of the normalization process can be mapped between 0 and 1 by subjecting the feature map to the activation function process.

In a possible implementation manner, please refer to fig. 6, the detection network may include a basic convolution module and a separable convolution module, the traffic region image 6-1 may be input into the detection network 6-2, the anchor frame is subjected to feature extraction by the basic convolution module in the detection network 6-2 to obtain a feature map, convolution processing is performed by the separable convolution module to obtain output parameters 6-3, the output parameters include a position offset and a size offset corresponding to each extracted type of anchor frame, and a confidence and a category result corresponding to the preset frame are determined by an activation function. The confidence coefficient is used for representing the probability of the traffic sign existing in the preset frame, and the classification result is used for judging whether the traffic sign exists in the preset frame or not.

It is to be understood that the position offset may include an abscissa offset and an ordinate offset of the preset frame with respect to the center point of the anchor frame, and the size offset may include a width offset and a height offset of the preset frame with respect to the anchor frame.

S202, determining a preset frame corresponding to the anchor frame based on the position of the anchor frame, the size of the anchor frame, and the position offset and the size offset corresponding to the anchor frame.

Specifically, after the position offset and the size offset corresponding to the anchor frame are obtained, the anchor frame may be corrected according to the position of the anchor frame, the size of the anchor frame, and the position offset and the size offset corresponding to the anchor frame, so as to obtain the preset frame corresponding to the anchor frame. Wherein, a preset correction algorithm can be adopted for processing. For example, when the position of the preset frame corresponding to the anchor frame is determined, the position of the anchor frame and the position offset corresponding to the anchor frame may be added, and when the size of the preset frame corresponding to the anchor frame is determined, the size of the anchor frame and the size offset corresponding to the anchor frame may be added, so as to obtain the position and the size of the preset frame corresponding to the anchor frame.

Illustratively, it is assumed that when the positional offsets (center point abscissa offset and ordinate offset) of the anchor frame are determined to be (Δ x, Δ y) and the dimensional offsets (width offset and height offset) are Δ w and Δ h, respectively, and the positions (center point abscissa and ordinate) of the anchor frame are determined to be (x) ₁ ,y ₁ ) The size (width and height) of the anchor frame is w ₁ And h ₁ Then, the position coordinate of the preset frame corresponding to the anchor frame can be determined as (x) ₁ +Δx，y ₁ + Δ y), the width dimension of the preset frame is Δ w + w ₁ The height of the preset frame is delta h + h ₁ And determining the preset frame corresponding to the anchor frame.

Further, for each feature point on the traffic body area image, the correction algorithm may be applied to each type of anchor frame to obtain a preset frame corresponding to each type of anchor frame, so that N preset frames corresponding to the N anchor frames may be obtained. The N preset frames may include a frame with a traffic sign or a frame without a traffic sign.

S203, filtering invalid preset frames in the preset frames corresponding to the N types of anchor frames, and obtaining a traffic sign area image based on the areas corresponding to the residual anchor frames.

The remaining anchor frame corresponding region is a region selected by the anchor frame. The invalid preset frame is a preset frame which has an overlarge deviation with a real prediction frame, and the real preset frame is a preset frame only containing a traffic sign area. The invalid preset frame may be, for example, a preset frame that does not completely include the traffic sign area, or a preset frame that includes not only the traffic sign area but also the background area. The background refers to the remaining images of the traffic body area image except for the traffic sign, and may include, for example, trees, sky, roads, poles, vehicles, and the like in the traffic body area image.

It can be understood that a large number of preset frames may be generated at the same target position during the traffic sign detection process, and these preset frames may overlap with each other, and at this time, a Non-Maximum-Suppression (NMS) algorithm needs to be used to repair and eliminate redundant invalid preset frames, so as to determine the optimal target preset frame, i.e., determine the traffic sign region image based on the region selected by the remaining anchor frames. The maximum suppression algorithm may be, for example, a DIoU-NMS algorithm.

As an optional implementation manner, the N preset boxes may be classified and summarized to obtain a preset box list and a corresponding confidence score list thereof, and the flow of the non-maximum suppression algorithm may be: and acquiring a preset frame list and a corresponding confidence score list thereof, and setting a threshold value, wherein the threshold value is used for deleting the preset frame with larger overlap. The method includes the steps of firstly sorting according to confidence score, selecting the preset frame with the highest confidence to be added into a final output list, deleting the preset frame from the preset frame list, calculating the area of all the preset frames, calculating the overlapping degree (overlapping area proportion) IoU of the preset frame with the highest confidence and other preset frames based on the area of the preset frames, deleting the preset frame with the IoU larger than a threshold value, and repeating the process until the preset frame list is empty.

For example: the obtained preset frame list comprises six rectangular frames, and the confidence degrees corresponding to the six rectangular frames in the confidence degree score list are respectively as follows according to the sequence from small to large: A. b, C, D, E and F. In the non-maximum suppression algorithm, starting from the confidence F of the maximum confidence rectangular frame, respectively judging whether the overlap degree IoU between the rectangular frames with the confidence a to E and the rectangular frame with the confidence F is greater than a set threshold, assuming that the overlap degree between the confidence B, D and the confidence F exceeds the threshold, removing the rectangular frames with the confidence B and D, marking and retaining the first rectangular frame with the confidence F, then selecting the E with the maximum confidence from the remaining rectangular frames with the confidence a, C and E, then judging the overlap degree between the rectangular frame with the confidence E and the confidence a and C, when the overlap degree is greater than the set threshold, removing the rectangular frames with the confidence a and C, marking and retaining the second rectangular frame with the confidence E, and repeating the above processes to determine the retained rectangular frame, thereby obtaining the corresponding traffic body region image.

According to the embodiment of the application, the position offset and the size offset corresponding to each type of anchor frame can be accurately determined by inputting the traffic sign area image into the detection network, so that the preset frame corresponding to the anchor frame is accurately determined based on the position of the anchor frame, the size of the anchor frame, the position offset and the size offset corresponding to the anchor frame, and the accuracy of determining the traffic sign area image is further improved.

In another embodiment of the application, feature extraction can be performed on the traffic body area image to obtain a position offset and a size offset corresponding to each type of anchor frame. Referring to fig. 7, a first feature map 7-3 can be obtained by performing a basic convolution process on the traffic body region image 7-1 through a basic convolution module 7-2; the basic convolution processing sequentially comprises convolution, normalization and activation function processing; and the separable convolution processing is carried out on the first feature map 7-3 through the separable convolution module 7-4 to obtain a second feature map 7-5, and then the basic convolution processing is carried out on the second feature map 7-5 through the basic convolution module 7-6 to obtain an output parameter 7-7, wherein the output parameter can comprise a position offset and a size offset.

Specifically, after the traffic body area image is obtained, a target area of the traffic body area image can be determined based on an anchor frame, the target area is an area selected by the anchor frame in the traffic body area image, then the target area is subjected to convolution, normalization and activation function processing in sequence through a basic convolution module to obtain a first feature map, the basic convolution module can be a ConvBnRelu module, wherein the ConvBnRelu module comprises a Conv convolution layer, a Bn layer and a Relu layer which are connected in sequence, conv convolution layer operation is used for extracting image features such as edges and textures in the traffic body area image anchor frame to obtain image features of the traffic body area image, the Bn layer is used for performing normalization processing on the image features extracted by the Conv convolution layer, the output features of the Conv convolution layer are input, noise features in the image features are filtered to obtain filtered image features, and then the filtered image features are subjected to linear mapping through an activation function to enhance the generalization capability of the feature extraction model to obtain the first feature map. Wherein, the first feature map comprises a plurality of feature points.

After the first feature map is obtained, the first feature map may be processed by a separable convolution module, the separable convolution module includes one or more DW-PW convolution layers, the DW-PW convolution layers may be sequentially processed through DW convolution and PW convolution, where DW convolution is used for filtering, PW convolution is used for converting channels, the DW-PW convolution layers may include a depthwise layer, a BN layer, and a Relu layer, and convolution processing, normalization, and activation function processing are sequentially performed, so that the normalized weighted features are weighted to the features of each channel to obtain a second feature map, then base convolution processing is performed on the second feature map, the features in the second feature map are sequentially extracted through a Conv convolution layer, then the features extracted through a BN layer are normalized, and input of the features is output features of the Conv convolution layer, so as to obtain normalized features, and then the normalized features are processed through an activation function to obtain output parameters, where the output parameters may include confidence of a position of a prediction frame relative to an anchor frame, a size of the prediction frame, and a result of the prediction frame corresponding to a type of the prediction frame.

Optionally, during the separable convolution processing, the number of the DW-PW convolutional layers may be set to be multiple, or may be set to be one, where the larger the number of the DW-PW convolutional layers is, information with different granularities may be extracted, so that the more comprehensive the information of the feature map is extracted, and the more accurate the obtained position offset and size offset are.

It should be noted that, the DW Convolution (Depthwise Convolution) is different from the conventional Convolution operation, and one Convolution kernel of the Depthwise Convolution is responsible for one channel, and one channel is convolved by only one Convolution kernel. Whereas conventional convolution of each convolution kernel operates on each channel of the input picture simultaneously.

For example: for a 5 × 5 pixel size, three-channel color input picture (shape is 5 × 5 × 3), the Depthwise contribution is first subjected to a first Convolution operation, unlike the conventional Convolution, where DW is performed entirely in a two-dimensional plane. The number of convolution kernels is the same as that of the channels in the previous layer (the channels correspond to the convolution kernels one by one), so that 3 Feature maps are generated after the operation of the image of one three channel.

The number of Feature maps after the completion of the Depthwise Convolition is the same as the number of channels of the input layer, and the number of Feature maps cannot be expanded. Moreover, the operation performs convolution operation on each channel of the input layer independently, and the feature information of different channels at the same spatial position is not effectively utilized. Therefore, PW Convolution (position Convolution) is required to combine these Feature maps to generate a new Feature map.

The operation of poitwise Convolution is very similar to the conventional Convolution operation, the size of the Convolution kernel is 1 × 1 × M, and M is the number of channels in the previous layer. Therefore, the convolution operation here performs weighted combination on the maps in the previous step in the depth direction to generate a new Feature map. There are several convolution kernels with several output Feature maps. That is, the Depthwise layer only changes the size of feature map, and does not change the number of channels. On the contrary, the poitwise layer only changes the number of channels and does not change the size.

In the embodiment, the target area of the traffic body area image is determined based on the anchor frame, and the basic convolution processing and the separable convolution processing are performed on the target area, so that the information with a finer granularity in the image can be extracted, and the parameter quantity can be reduced and the detection speed can be increased through the DW convolution processing and the PW convolution processing in the separable convolution.

In another embodiment of the application, before the traffic sign detection is performed on the traffic body area image, the size of the traffic body area image may be adjusted according to a preset size parameter.

Specifically, the size and shape of the obtained traffic area image may be different, and the traffic area image needs to be subjected to size adjustment processing so that the size and shape of the traffic area image are adjusted to an image meeting the input requirements of the detection network.

The position information and the size information of the traffic body area image can be acquired, then the traffic body area image is subjected to image scaling and expansion processing according to preset size parameters based on the position information and the size information of the traffic body area image, and the coordinate position of the traffic body area is subjected to mapping processing.

It should be noted that the preset size parameter may be set by a user according to an actual input requirement of the detection network, and may include preset size information or position information. The position information and the size information of the traffic body area image can be obtained, and when the position information and the size information of the traffic body area image do not accord with preset parameters, the traffic body area image needs to be subjected to image scaling resize and expansion processing padding.

Optionally, the image scaling may use a resize function to change the size of the image, and may be implemented by, for example, the following interpolation: nearest neighbor interpolation, cubic spline interpolation, linear interpolation, regional interpolation.

As an implementation manner, when the image of the traffic body area is expanded, preset software may be used for processing, and a user may set relevant parameters in a customized manner according to actual requirements, where the parameters may be, for example, function options "expansion" and "size" selected in a customized manner by the user, and then the preset software is run to expand the size of the traffic sign area of the image of the traffic body area according to the relevant parameters, so as to obtain a processed image, where the processed image meets the input requirements of the detection network. The preset software may be image processing software.

When the size expansion processing is performed on the traffic sign area, the expansion processing may be performed according to a preset ratio with the traffic sign area as a center. The preset proportion is set according to actual requirements in a self-defined mode, for example, the preset proportion of the size expansion of the traffic sign area and the processed image can be 1:2,1:3,1:4,2:3, etc.

For example, when it is determined that the corresponding size of the traffic sign region in the traffic body region image is W × H, it is assumed that the preset ratio determined by the size is 1: and 2, performing size expansion processing on the target area to obtain a processed image, wherein the size of the obtained processed image is 2W multiplied by 2H, and the processed image comprises the traffic sign area.

For example, after the traffic body area image is acquired, whether the traffic body area image needs to be adjusted or not can be judged according to a preset model input rule, and when the adjustment is not needed, feature extraction can be directly performed on the traffic body area image to obtain a position offset and a size offset corresponding to each type of anchor frame. Referring to fig. 8, when adjustment processing is required, a preset size parameter may be selected according to actual requirements, for example, a preset side length s is selected, then image scaling resize processing is performed on the traffic sign region image according to a preset proportion, and the long side after the image scaling resize processing is s, then, for the short side of the traffic sign region image, expansion processing may be performed according to a preset proportion by taking the traffic sign region as a center, and for example, image expansion padding may be performed symmetrically on both sides, so that the size of the processed traffic sign region image is s, and then, according to a mapping relationship between the traffic sign region image and the processed traffic region image, coordinates of the traffic sign region are mapped onto the processed traffic region image with the size of s, so that feature extraction may be performed on the processed traffic region image, so as to obtain a corresponding position offset and a size offset for each type of anchor frame.

In another embodiment of the present application, a specific implementation of a training process for training a detection network is also provided. Please refer to fig. 9, which specifically includes:

s301, obtaining M anchor frames in the historical traffic body image; the corresponding size and position of each anchor frame are different, M is an integer larger than or equal to 1, and the historical traffic volume image comprises a marked traffic volume area.

Specifically, the historical traffic volume images may be multiple or one, where each historical traffic volume image may include at least one traffic sign, for example, the historical traffic volume image may include an automobile traffic light sign, a non-automobile traffic light sign, or a sidewalk traffic light sign.

After the historical traffic body image is obtained, the historical traffic body image is different in size and shape and does not meet the input requirement during the training of the first initial network, so that the historical traffic body image needs to be processed, the historical traffic body image can be subjected to size adjustment processing according to preset parameters, for example, a fixed side length s can be selected, then the traffic body is subjected to image scaling according to a preset proportion, the long side of the scaled image is s, then the short side can be subjected to image expansion processing symmetrically on two sides by taking the traffic body area as the center, the size of the processed image is s, and the coordinates of the traffic sign area are mapped to the processed image with the size of s according to the mapping relation between the processed image and the historical traffic body image, so that the historical traffic body image meeting the input requirement of the first initial network is obtained.

And then determining an anchor frame of the historical traffic body image, wherein the width and the height of the historical traffic body image can be used as detection characteristics, clustering is carried out by adopting a k-mean algorithm based on the detection characteristics to obtain M clusters, the size of the anchor frame is determined according to the length and the width corresponding to the position coordinates of the central point of each cluster, and the class center of the M clusters is determined as the width and the height of the corresponding anchor frame. The size and position of the anchor frames of different types also differ.

S302, inputting the historical traffic body image into a first initial network, and performing feature extraction on the historical traffic body image to obtain the position offset and the size offset corresponding to each type of anchor frame in the M anchor frames.

The first initial network is a neural network model which inputs the historical traffic body image, outputs the position offset and the size offset corresponding to each type of anchor frame, has the capability of extracting the features of the historical traffic body image, and can predict the position offset and the size offset corresponding to each type of anchor frame in the historical traffic body image. The first initial network may be an initial model during iterative training, that is, a model parameter of the first initial network is in an initial state, or may be a model adjusted in a previous iterative training, that is, a model parameter of the first initial network is in an intermediate state. The historical traffic volume image may be input into the first initial network to obtain an output result, where the output result may include a position offset and a size offset corresponding to the prediction frame with respect to each type of anchor frame, a confidence corresponding to the prediction frame, and a category result corresponding to the prediction frame. The class result is used for representing whether the prediction frame contains the traffic sign, and the confidence coefficient is used for representing the probability that the image in the prediction frame is the traffic sign.

After the historical traffic volume image is input into the first initial network, the historical traffic volume image can be sequentially subjected to convolution, normalization and activation function processing through the basic convolution module to obtain a corresponding first sample characteristic diagram, then the sample characteristic diagram is subjected to convolution processing through the separable convolution module to obtain a second sample characteristic diagram, and the second sample characteristic diagram is sequentially subjected to convolution, normalization and activation function processing through the basic convolution module to obtain an output result of the first initial network.

Wherein, the basic convolution module can comprise a Conv convolution layer, a Bn layer and a Relu layer which are connected in sequence. The image features of the edges, the textures and the like of each anchor frame in the historical traffic volume image are extracted from the historical traffic volume image through a Conv convolutional layer to obtain the image features of the historical traffic volume image, then normalization processing is carried out on the image features extracted from the convolutional layer through a Bn layer according to normal distribution to filter noise features in the image features to obtain filtered image features, nonlinear mapping is carried out on the filtered image features through an activation layer to strengthen the generalization capability of a feature extraction model, and a first sample feature map is obtained. Wherein, the first sample characteristic diagram comprises a plurality of characteristic points. The separable convolution module may include a DW convolution layer and a PW convolution layer.

S303, determining a prediction frame corresponding to each of the M anchor frames based on the position of the anchor frame, the size of the anchor frame, and the position offset and the size offset corresponding to the anchor frame.

S304, filtering invalid preset frames in the prediction frames corresponding to the M anchor frames to obtain a prediction traffic body area of the first initial network.

Specifically, after the position offset and the size offset corresponding to the anchor frame are obtained for each of the M anchor frames, the anchor frame may be corrected according to the position of the anchor frame, the size of the anchor frame, the position offset and the size offset corresponding to the anchor frame, so as to obtain a preset frame corresponding to the anchor frame, thereby obtaining M prediction frames. And then filtering invalid prediction frames in the prediction frames corresponding to the M anchor frames by adopting a non-maximum suppression algorithm, thereby obtaining a prediction traffic body area of the first initial network.

S305, based on a loss function between the predicted traffic body area and the marked traffic body area, performing iterative training on the first initial network by adopting an iterative algorithm according to the minimization of the loss function to obtain a detection network.

After the output result is obtained, a loss function can be constructed based on the output result and a labeling result of the traffic body area, optimization processing is carried out on the first initial network to be constructed according to the minimization of the loss function, a detection network is obtained, parameters in the first initial network to be constructed are updated according to the difference between the output result and the labeling result, and the purpose of training the first initial network is achieved, wherein the labeling result can be an identification result obtained by manually labeling a historical traffic body image.

Optionally, the updating of the parameters in the first initial network to be constructed may be updating of matrix parameters such as a weight matrix and a bias matrix in the first initial network to be constructed. The weight matrix and the bias matrix include, but are not limited to, matrix parameters in a convolutional layer, a feed-forward network layer and a full link layer in the first initial network to be verified.

When the parameters of the first initial network to be verified are updated through the loss function, the parameters in the model may be adjusted to make the first initial network to be verified converge when the first initial network to be verified is determined not to be converged according to the loss function, so as to obtain the detection network. The convergence of the first initial network to be verified may be that a difference between an output result and a labeling result of the first initial network to be verified is smaller than a preset threshold, or a change rate of the difference between the output result and the labeling result approaches a certain lower value. When the calculated loss function is small, or the difference between the calculated loss function and the loss function output in the previous iteration is close to 0, the first initial network to be verified is considered to be converged, and the detection network can be obtained.

In another embodiment of the present application, a specific implementation of the penalty function of the first predictive model is also provided. In one possible implementation, the loss function of the first initial model includes a first component, a second component, a third component, and a third component. The first component is used for representing the position offset loss between the prediction frame and the anchor frame; the second component is used for representing the loss of the size offset between the prediction frame and the anchor frame; the third component is used for representing the loss between the prediction confidence coefficient corresponding to the prediction box and the real confidence coefficient; the fourth component is used for representing the loss between the prediction category result corresponding to the prediction box and the real category result.

Wherein the constructed loss function may be a sum of the first component, the second component, the third component and the fourth component. By setting the first component and the second component in the loss function, the difference between the position offset between the prediction frame and the anchor frame corresponding to the historical traffic body image input by the first initial model and the size offset between the prediction frame and the anchor frame corresponding to the historical traffic body image input by the first initial model can be reduced, so that the position offset and the size offset of the prediction frame relative to the anchor frame can be comprehensively determined, and by setting the third component and the fourth component, guidance information can be provided for determining the final predicted traffic body region, so that the obtained detection network is more optimal.

In the embodiment of the application, when the loss function is constructed to obtain the detection network, the difference between the real category result and the prediction category result of the first initial model is integrated, the position offset difference between the prediction frame and the anchor frame corresponding to the historical traffic body image input into the first initial model, and the size offset difference between the prediction frame and the anchor frame corresponding to the historical traffic body image input into the first initial model are integrated, the first initial model is trained based on the loss function, model parameters in the first initial model can be iteratively trained more accurately and comprehensively, the obtained detection network is better, and the accuracy of the determined traffic sign area image is higher.

In another embodiment of the present application, in the training process of the model, reasonable weight coefficients may be further allocated to the first component, the second component, the third component, and the fourth component in the loss function, so that the prediction difference of the model is highly matched with the actual service demand, and the performance of the model can be improved. In one possible implementation manner, when determining the loss function, the loss function may be determined according to the weighting coefficients of the first component, the second component, the third component, and the fourth component, the first component, the second component, the third component, and the fourth component by determining the weighting coefficients of the first component, the second component, the third component, and the fourth component. Wherein the weight coefficient of the first component is related to the importance of the position offset between the prediction frame and the anchor frame; the weighting factor of the second component is related to the importance of the size offset between the prediction box and the anchor box; the weight coefficient of the third component is related to the importance degree of the confidence corresponding to the prediction frame; the weight coefficient of the fourth component is related to the importance of the prediction class result corresponding to the prediction box.

The weight coefficient of the first component is positively correlated with the importance degree of the position offset amount between the prediction frame and the anchor frame, that is, the higher the weight coefficient of the first component is, the higher the importance degree of the position offset amount between the prediction frame and the anchor frame is. Similarly, the weight coefficient of the second component has a positive correlation with the importance degree of the size offset between the prediction frame and the anchor frame, that is, the larger the weight coefficient of the second component is, the higher the importance degree of the size offset between the prediction frame and the anchor frame is. The weight coefficient of the third component is positively correlated with the importance degree of the confidence corresponding to the prediction frame, that is, the higher the weight coefficient of the third component is, the higher the importance degree of the confidence corresponding to the prediction frame is. The weight coefficient of the fourth component is positively correlated with the importance degree of the prediction type result corresponding to the prediction frame, that is, the higher the weight coefficient of the fourth component is, the higher the importance degree of the prediction type result corresponding to the prediction frame is.

The first component, the second component, the third component, the fourth component and the loss function satisfy the following formula:

Y＝a ₁ *y ₁ +a ₂ *y ₂ +a ₃ *y ₃ +a ₄ *y ₄

wherein Y is a loss function in the training process of the first initial model, and Y is ₁ Is a first component, a ₁ Is the weight coefficient of the first component, y ₂ Is a second component, a ₂ Is the weight coefficient of the second component, y ₃ Is a third component, a ₃ Is the weight coefficient of the third component, y ₄ Is a fourth component, a ₄ Is the weight coefficient of the fourth component. ,

in the embodiment of the application, the weight coefficients of the first component, the second component, the third component and the fourth component can be reasonably determined according to the service scene applied by the traffic sign identification method. In a possible implementation manner, the position offset and the size offset between the prediction frame and the anchor frame, the prediction confidence corresponding to the prediction frame, and the importance degree proportion of the prediction type result are determined according to the service requirement, and the weight coefficients of the first component, the second component, the third component, and the fourth component are determined according to the importance degree proportion.

Illustratively, the above-mentioned loss function may be represented by the following formula:

wherein, the first line formula parameter in the above formula represents the position offset loss of the prediction frame relative to the anchor frame, the second line formula parameter represents the size offset loss of the prediction frame relative to the anchor frame, S represents the width and height of the output characteristic diagram, B represents the number of the anchor frames at each position of the output characteristic diagram,

whether the traffic sign exists at the position ij of the output characteristic diagram is represented, if the traffic sign exists, the value is 1, and if not, the value is 0; the third row formula parameterizes confidence loss; the fourth row formula parameter is category penalty; α, β, γ denote weight coefficients of the first component, the second component, the third component and the fourth component, respectively, x _ij The abscissa value of the true center point of the anchor frame of the characteristic diagram at the ij position, y _ij Representing the corresponding real central point ordinate value, W, of the characteristic diagram at the ij position _ij Width h of anchor frame corresponding to ij position of characteristic diagram _ij Indicates the height, C, of the corresponding anchor frame at the ij position _ij Representing the true confidence of the corresponding anchor box at the ij position, p (k) representing the true class result,

representing the horizontal coordinate offset of the prediction frame with respect to the center point of the anchor frame,

representing the ordinate offset of the prediction frame from the centre point of the anchor frame,

indicating the width offset of the prediction frame relative to the anchor frame,

indicating the height offset of the default frame relative to the anchor frame,

the confidence level of the prediction box is represented,

the prediction category result of the prediction box is represented.

In the embodiment, the loss function is constructed by setting the first component, the second component, the third component and the fourth component, and the model parameters in the first initial network can be accurately adjusted by optimizing the four-part loss function, so that the detection network is more optimal, and the traffic sign can be accurately identified.

In another embodiment of the present application, an implementation manner of performing sign type detection on the traffic sign region image to obtain a traffic sign recognition result of the traffic sign to be detected is further provided. Please refer to fig. 10, which specifically includes:

s401, cutting the image of the traffic sign area to obtain the image of the traffic sign.

It should be noted that the traffic sign region image may include a traffic sign image, and may further include a background region other than the traffic sign region, and in order to improve the accuracy of identifying the traffic sign image, after the traffic sign region image is obtained, the traffic sign region image may be cut to remove a redundant background region, so as to obtain the traffic sign image. The traffic sign image contains only the traffic sign region.

Specifically, in the process of cutting the traffic sign area image, the background area on the traffic sign area image can be determined through image recognition, then the image cutting processing is performed on the background area through the image processing software, and the remaining image is the traffic sign image only containing the traffic sign area. As another implementation manner, the traffic sign area on the image of the traffic sign area may be determined through image recognition, and then the image processing software performs cropping processing on the traffic sign area, so as to obtain the traffic sign image only including the traffic sign area. And adjusting the traffic sign image only containing the traffic sign area according to different sizes and scaling ratios to obtain the traffic sign image capable of being input into the classification network.

For example, referring to fig. 11, generally, since the traffic sign region is close to a square, a side length s1 of a target size may be defined according to actual requirements, and then the size of the traffic sign region image is adjusted according to a preset scaling so that the size of the traffic sign image is s1 × s1.

S402, inputting the traffic sign image into a classification network for sign type detection to obtain pixel information and shape information of the traffic sign image.

S403, performing semantic prediction based on the pixel information and the shape information to obtain the prediction semantics of the traffic sign image, and determining the traffic sign recognition result of the traffic sign image to be detected according to the prediction semantics.

The classification network is obtained by training based on a historical traffic sign region image and a historical classification result of the historical traffic sign region image, and the traffic sign region is labeled on the historical traffic sign region image. The classification network is a neural network model which inputs the traffic sign images and outputs the traffic sign identification results of the traffic sign images, has the capacity of identifying the traffic sign types of the traffic sign images and can predict the traffic sign identification results.

The classification network is used for establishing the relation between the characteristics of the traffic sign images and the types of the target traffic signs, and the model parameters of the classification network are in the optimal state. The classification network may include, but is not limited to, a convolutional layer, a fully-connected layer, and an activation function, and the convolutional layer and the fully-connected layer may include one layer or may also include multiple layers. The convolutional layer is used for extracting the characteristics of the traffic sign image, and the full-connection layer is mainly used for classifying the characteristics of the convolutional layer after processing. The traffic sign image may be processed by a convolutional layer to obtain convolutional features, the convolutional features are processed by a full link layer to obtain full link vectors, and the full link vectors are processed by an activation function to obtain an output result of the detection network, where the output result includes an element type of the traffic sign or may include a plurality of traffic sign attributes of the traffic sign under the traffic sign type.

The activation function may be a Sigmoid function, a Tanh function, or a ReLU function, and the result of the activation function may be mapped between 0 and 1 by subjecting the full connection vector to activation function processing.

In one embodiment, as shown in fig. 12, the classification network may include a feature extraction network 12-2 and a target classification network 12-3, and the traffic sign image 12-1 is processed sequentially through the feature extraction network 12-2 and the target classification network 12-3 to obtain a traffic sign recognition result 12-4. The feature extraction network is used for detecting the mark types of the traffic mark images and extracting pixel information and shape information of the traffic mark images, the target classification network is used for performing semantic prediction based on the pixel information and the shape information to obtain prediction semantics of the traffic mark images, and the traffic mark recognition results of the traffic mark images to be detected are determined according to the prediction semantics. The pixel information may include pixel information of different colors, such as red, green, and yellow. The shape information may include different shapes, and may be, for example, straight, left-turn, right-turn, or the like.

In a possible implementation manner, the feature extraction network may include a basic convolution module, a separable module, and a pooling module, the basic convolution module may be a ConvBnRelu module, where the ConvBnRelu module includes a Conv convolution layer, a Bn layer, and a Relu layer that are connected in sequence, and extracts feature information of the traffic sign image through the Conv convolution layer, the Bn layer, and the Relu layer, and the separable convolution module may include a DW-PW convolution layer, and the DW-PW convolution layer may include one or more DW-PW layers, and performs finer-grained feature extraction based on the feature information, and performs feature dimension reduction through the pooling module to obtain pixel information and shape information.

It should be noted that the pooling module may include a pooling layer, which is also referred to as undersampling or downsampling. The method has the functions of reducing the dimension of the features, compressing the number of data and parameters, reducing overfitting, improving the fault tolerance of the model, namely, selecting the features, reducing the number of the features, avoiding overfitting and further reducing the number of the parameters.

Optionally, the above-mentioned pooling layer may be: max Pooling, average Pooling. Where maximal pooling is generally taken to be the maximum of all neurons in a region. Average pooling is generally an average of all neurons in the region.

In this embodiment, the maximum value of the local region can be extracted through the maximum pooling layer, which not only has better feeling on the local small features, but also can eliminate the non-maximum value and reduce the computational complexity of upsampling, so that the maximum pooling operation of the convolution computation kernel is performed alternately with the increase of the network depth, and the image features can be further refined.

The processing of the target classification network may specifically include: and calculating the pixel information and the shape information through a multi-classification function, and outputting the type of the traffic sign. And the pixel information and the shape information can be operated through multivariate binary classification, and the fusion characteristic attribute is output. Optionally, the multi-classification function may be a softmax function, the multivariate two-classification function may be a plurality of sigmoid functions, and one sigmoid function may implement one two-classification prediction. The function of the multi-classification function is to add a non-linear factor, because the linear model has insufficient expressive ability, and can transform the continuous real value of the input into the output between 0 and 1.

Illustratively, the pixel information and the shape information are input into the object classification network, and the prediction result of the object classification network may include any one of traffic sign types such as "motor vehicle signal light", "non-motor vehicle signal light", and "crosswalk signal light". The prediction result may further include a traffic sign attribute, for example, the sign attribute corresponding to the traffic sign type "motor vehicle signal light" may be a plurality of "green light straight," red light left turn, "" red light right turn, "" green light right turn.

In which, taking three classes as an example, the output of the multi-class function is introduced. For example, the traffic sign types that can be predicted by the multi-classification function are "motor vehicle signal light", "non-motor vehicle signal light" and "crosswalk signal light", respectively, and the output result of the target classification network may be represented by a vector, for example, a 3 x 1-dimensional vector, each element in the vector corresponds to a traffic sign type, and each element in the vector represents the probability that the traffic sign is of the corresponding tag class. Assuming that the output vector of the multi-classification function is [0.61,0.31,0.08], the probability that the traffic sign is "motor vehicle signal light" is 0.61, the probability that the traffic sign is "non-motor vehicle signal light" is 0.31, and the probability that the traffic sign is "crosswalk signal light" is 0.08, and the element value with the highest probability can be selected as the prediction result of the traffic sign, that is, "motor vehicle signal light" is used as the recognition result of the traffic sign.

Taking ternary classification as an example, the output of the multivariate binary classification function is introduced. For example, the sign attributes that the multivariate binary classification function can predict are represented by "straight green light", "left turn red light", "right turn green light", and "right turn red light", and the output result may be represented by a vector, for example, a 4 x 1-dimensional vector, where each element in the vector corresponds to a sign attribute, and each element value in the vector represents the probability that the traffic sign is the corresponding sign attribute. Assuming that the output vector of the ternary-binary classification function is [0.51,0.15,0.22, 0.62], it means that the probability of the traffic sign being "green light straight" is 0.51, the probability of the traffic sign being "red light left-turning" is 0.15, the probability of the traffic sign being "green light right-turning" is 0.22, and the probability of the traffic sign being "red light right-turning" is 0.62. Assuming that the preset threshold is 0.5, the element value of the probability with the probability greater than the preset threshold is used as the prediction result of the traffic sign, namely the 'straight green light' and the 'right turn red light' are used as the recognition result of the multivariate binary classification function.

In the embodiment of the application, the traffic sign images are input into the classification network for sign type detection, so that the pixel information and the shape information of the traffic sign images can be accurately determined, the accuracy of determining the recognition result of the traffic sign is improved to a great extent, the recognition result can be obtained more accurately, and the traffic sign recognition with higher accuracy is realized.

In another embodiment of the present application, a specific implementation of a training process for training a classification network is also provided. The training process comprises: acquiring historical traffic sign area images, determining the historical traffic sign images based on the historical traffic sign area images, and marking the historical traffic sign images with corresponding historical classification results; and then training a classification network based on the historical traffic sign images and the historical classification results.

Specifically, after the historical traffic sign area image is acquired, the historical traffic sign image may be cut to obtain the cut historical traffic sign area image, and then the size adjustment and data enhancement processing may be performed on the historical traffic sign area image to obtain the historical traffic sign image. For example, the clipped historical traffic sign area image may be adjusted according to different sizes and scales, and data enhancement processing is performed to obtain the historical traffic sign image.

It should be noted that data enhancement is a technique for artificially expanding a training data set by letting limited data produce more equivalent data. The data enhancement algorithms in the field of computer vision can be roughly divided into two types: the first is data enhancement based on image processing techniques and the second is data enhancement algorithms based on deep learning.

The image processing technique based data enhancement described above may include geometric transformations, color transformations, rotation/reflection transformations, noise injection, kernel filters, blending images, random erasing, scaling transformations, shifting, flipping transformations, cropping, and the like. The above-described Data enhancement algorithm based on deep learning may include feature space enhancement, countermeasure generation, GAN-based Data enhancement (GAN-based Data Augmentation), neural style conversion, and the like.

Compared with the prior art, the cut historical traffic area image has no shape limitation of a traffic body background area, and for the traffic signs of rare categories, the data enhancement can be conveniently carried out by using other traffic signs of the same type and color, so that the historical traffic sign images with sufficient quantity and balanced categories are obtained, and the trained classification network has higher accuracy.

In the process of training the classification network based on the historical traffic sign image and the historical classification result, the historical traffic sign image can be input into a second initial network to carry out sign type detection, pixel information and shape information of the historical traffic sign image are obtained, semantic prediction is carried out based on the pixel information and the shape information of the historical traffic sign image, prediction semantics of the historical traffic sign image are obtained, the prediction result of the historical traffic sign image is determined according to the prediction semantics, a loss function is calculated according to the prediction result of the historical traffic sign image and the historical classification result, the loss function is minimized, parameters of the second initial network are iteratively adjusted by adopting an iterative algorithm, and the classification network is obtained.

Specifically, after obtaining the historical traffic sign images, the computer device may randomly divide the historical traffic sign images into a training set and a verification set according to a certain proportion, where the training set is used to train the second initial network to obtain a trained classification network, and the verification set is used to verify the trained detection network to verify the performance of the classification network. And then respectively inputting the historical traffic sign images of the training set into a second initial network for sign type detection, firstly processing the historical traffic sign images through an initial feature extraction network and an initial target classification network, carrying out convolution, normalization and activation function processing on the historical traffic sign images through a Conv convolution layer, a Bn layer and a Relu layer which are sequentially connected in the initial feature extraction network to obtain output features, and then carrying out feature extraction and feature dimension reduction processing on the output features through a DW-PW convolution module and a pooling module to extract pixel information and shape information of the historical traffic sign images.

And processing the pixel information and the shape information of the historical traffic sign image through a full-connection layer and an activation function respectively through an initial target classification network to obtain a sample full-connection vector, and processing the sample full-connection vector by using the activation function to obtain a corresponding output result. And training the feature extraction network and the target classification network to be constructed by utilizing the training set to obtain the feature extraction network and the target classification network to be verified.

The second initial model is a neural network model that has an input of the historical traffic sign image and an output of the prediction result of the historical traffic sign image, and has the capability of extracting information and detecting classification of the historical traffic sign image, and can predict the prediction result of the historical traffic sign image. The second initial model may be an initial model during iterative training, that is, the model parameters of the second initial model are in an initial state, or may be a model adjusted in the previous iterative training, that is, the model parameters of the second initial model are in an intermediate state. The historical traffic sign image may be input into the second initial model to obtain an output. The output includes the traffic sign type of the historical traffic sign image or may include a plurality of traffic sign attributes of the historical traffic sign image under the traffic sign type.

In the process of training the classification network, the computer equipment utilizes the feature extraction network and the target classification network to be verified in the verification set, optimizes the feature extraction network and the target classification network to be verified according to the loss function minimization to obtain the feature extraction network and the target classification network, and updates parameters in the feature extraction network and the target classification network to be constructed according to the difference between the classification network to be verified and a labeling result input by the verification set to achieve the purpose of training the feature extraction network and the target classification network, wherein the labeling result can be a traffic sign identification result obtained by manually labeling a historical image.

Optionally, the parameters in the feature extraction network and the target classification network to be verified are updated, and the parameters may be matrix parameters such as a weight matrix and a bias matrix in the feature extraction network and the target classification network to be constructed. The weight matrix and the bias matrix include, but are not limited to, matrix parameters in a convolutional layer, a feed-forward network layer and a full link layer in a feature extraction network to be verified and a target classification network.

In the embodiment of the application, the loss value of the result and the tag result obtained by inputting the verification set into the classification network to be verified can be calculated by using the loss function, so that the parameters in the feature extraction network and the target classification network to be verified are updated. Alternatively, the loss function may use a cross-entropy loss function, a normalized cross-entropy loss function, or may use Focalloss.

When the parameters of the feature extraction network to be verified and the target classification network are updated through the loss function, the parameters in the model can be adjusted to make the feature extraction network to be verified and the target classification network converge when the feature extraction network to be verified and the target classification network are determined not to converge according to the loss function, so that the feature extraction network and the target classification network are obtained. The convergence of the feature extraction network to be verified and the target classification network may be that a difference between an output result of the feature extraction network to be verified and a labeling result of the target classification network on the verification set and a labeling result of the training data is smaller than a preset threshold, or a change rate of the difference between the output result and the labeling result of the training data approaches a certain lower value. And when the calculated loss function is smaller or the difference between the calculated loss function and the loss function output in the previous iteration is close to 0, the feature extraction network to be verified and the target classification network are considered to be converged, and the classification network can be obtained.

It should be noted that, in the embodiment of the present application, the iterative training of the feature extraction network and the target classification network in the classification network are two independent processing processes, and may only perform the iterative training of the feature extraction network or only perform the iterative training of the target classification network. Of course, the feature extraction network and the target classification network may be iteratively trained, and the execution sequence of the feature extraction network and the target classification network is not limited, and may be executed in series in one iterative training or may be executed in parallel.

For a better understanding of the embodiments of the present application, a complete flowchart method of the method for identifying a traffic sign proposed in the present application is further described below. As shown in fig. 13, the method may include the steps of:

s501, carrying out traffic sign body detection on the traffic image to be detected to obtain a traffic body area image corresponding to the traffic image to be detected.

Specifically, please refer to fig. 14, taking a traffic sign as a traffic light as an example, image acquisition may be performed through an image acquisition device or a vehicle-mounted photographing device, specifically, a road running ahead is photographed to obtain a traffic image to be detected, where the traffic image to be detected includes not only the traffic sign (traffic light) to be identified, but also background information.

After the traffic image to be detected is acquired, the computer device may compress the directly acquired traffic image to be detected by adopting a lossy compression mode or a lossless compression mode to obtain a processed traffic image to be detected, in consideration of limited computing resources.

And then determining the number of anchor frames at each position of the lamp body detection network output layer, wherein the width and the height of the processed traffic image to be detected can be used as characteristics, clustering the processed traffic image to be detected by adopting a k-means algorithm to obtain a plurality of clusters, the number of the clusters can be the number of the anchor frames, and the size of the anchor frames is determined according to the length and the width corresponding to the position coordinates of the central point of each cluster. The method comprises the steps of inputting the traffic image to be detected into a basic convolution module and a separable convolution module in the lamp body detection network, sequentially passing through a convolution layer, a normalization layer and an activation layer in the basic convolution module to obtain a characteristic diagram, carrying out convolution processing on the characteristic diagram through the separable convolution module to obtain a characteristic diagram after convolution, passing the characteristic diagram after convolution through the convolution layer, the normalization layer and the activation layer in the basic convolution module to obtain the position offset and the size offset of a prediction frame relative to an anchor frame, determining a preset frame corresponding to the anchor frame based on the position of the anchor frame, the size of the anchor frame, the position offset and the size offset corresponding to the anchor frame, and filtering the invalid preset frame by adopting a non-maximum suppression algorithm to obtain a lamp body area image.

And S502, carrying out size adjustment processing on the traffic body area image according to the preset size parameters.

Taking a traffic sign as an example of a traffic signal lamp, after obtaining a lamp body area image, the lamp body area image can be cut to obtain a cut lamp body area image, then preprocessing is carried out, by obtaining position information and size information of the lamp body area image, based on the position information and the size information of the cut lamp body area image, according to a preset size parameter, the size parameter is for example a preset side length s, then the lamp body area image is zoomed, and the image zoomed resize is processed to have a long side s, then, for the short side of the lamp body region image, the traffic lamp region may be used as the center, and the expansion processing may be performed according to a preset ratio, for example, the image expansion padding may be performed bilaterally symmetrically, so that the size of the processed lamp body region image is s × s, and then the coordinates of the traffic lamp region are mapped onto the processed lamp body region image with the size of s × s according to the mapping relationship between the lamp body region image and the processed lamp body region image.

S503, determining N anchor frames in the traffic body area image, inputting the traffic body area image into a detection network, and performing feature extraction on the area corresponding to each anchor frame to obtain the position offset and the size offset corresponding to each anchor frame.

Specifically, the size of the lamp body region image can be obtained, where the size includes a width and a height, for each feature point in the lamp body region image, the width and the height are used as detection features, a k-mean algorithm is adopted to perform clustering processing based on the detection features, N clusters are obtained, the size of the anchor frame is determined based on the length and the width corresponding to the position coordinates of the center point of each cluster, and the class center of the N classes is determined as the width and the height of the corresponding anchor frame. The size and position of the anchor frames of different types also differ.

Referring to fig. 15, when the size of the acquired lamp body area image is 144 × 256, the number of input channels is 3, the image size is 144 × 256, the lamp body area image is input to the LED detection network, and is processed by the ConvBnRelu module in the LED detection network, so as to obtain a first feature map with the number of output channels being 16 and the image size being 72 × 128, and then the first feature map is subjected to feature extraction processing by the separable convolution module in the LED detection network, so as to obtain a second feature map with the number of output channels being 256 and the image size being 18 × 32, and the number of output channels being 256 and the second feature map with the image size being 18 × 32 are continuously processed by the ConvBnRelu module in the LED detection network, so as to obtain an output result with the number of output channels being 6n and the image size being 18 × 32. The output result comprises position offset (central abscissa offset and central ordinate offset) corresponding to each type of anchor frame, size offset (width offset and height offset), confidence degree corresponding to the prediction frame and prediction type result.

It should be noted that, in order to reduce the amount of calculation, the above separable convolution is designed by depthwise-pointwise, and the number of output channels of the LED detection network is N (4 +1+ c), where N is the determined number of anchor frames, and 4 represents the parameter number of position offset and size offset for each anchor frame, where the position offset may include the offset of the center abscissa and the center ordinate, and the size offset may include the width offset and the height offset. 1 represents the confidence number of the traffic signs in the preset frame, c represents the number of the target categories of the traffic signs in the preset frame, and if the category of the traffic signs in the preset frame is determined, c is 1. Therefore, the number of output channels of the LED detection network is N (4 + 1), i.e., 6 + N.

The ConvBnRelu module comprises a Conv convolution layer, a Bn layer and a Relu layer which are sequentially connected, in the process of processing through the ConvBnRelu module in the LED detection network to obtain a first feature map, image features such as edges and textures in each anchor frame in the lamp body area can be extracted through the Conv convolution layer to obtain image features of the lamp body area image, noise features in the image features are filtered through the Bn layer to obtain filtered image features, then the filtered image features are subjected to linear mapping through the Relu layer (activation function), the generalization capability of the feature extraction model is enhanced, and the first feature map with the size of 72 x 128 is obtained. Wherein, the first feature map comprises a plurality of feature points.

The separable convolution module can comprise a plurality of DW-PW layers which are connected in sequence, after a first feature map with the image size of 72 × 128 is obtained, the first DW-PW layer which is input into the separable module can be subjected to convolution extraction processing to obtain a feature map with the output channel number of 64 and the image size of 36 × 64, then the feature map with the output channel number of 64 and the image size of 36 × 64 is subjected to convolution extraction processing through a second DW-PW layer in the separable module to obtain a feature map with the output channel number of 128 and the image size of 18 × 32, and similarly, the feature map with the output channel number of 128 and the image size of 18 × 32 is subjected to convolution extraction processing sequentially through a third DW-PW layer and a fourth DW-PW layer in the separable module to obtain a second feature map with the output channel number of 256 and the image size of 18 × 32. Wherein, the second feature map comprises a plurality of feature points.

Optionally, the DW-PW layer may further include multiple BN layers and a Relu layer, where the BN layer is configured to perform normalization processing to obtain a normalization feature, and the Relu layer includes an activation function, and is configured to map a result of the activation function to a range from 0 to 1.

S504, determining a preset frame corresponding to the anchor frame based on the position of the anchor frame, the size of the anchor frame, and the position offset and the size offset corresponding to the anchor frame.

And S505, filtering invalid preset frames in the preset frames corresponding to the N types of anchor frames, and obtaining a traffic sign area image based on the areas corresponding to the residual anchor frames.

Specifically, after the position offset and the size offset corresponding to the anchor frame are obtained, a preset correction algorithm may be applied to each type of anchor frame, and the anchor frame is corrected based on the position of the anchor frame, the size of the anchor frame, and the position offset and the size offset corresponding to the anchor frame, so as to obtain N preset frames corresponding to N types of anchor frames, where the N preset frames may include frames with traffic lights or frames without traffic lights. And then filtering invalid preset frames in the N preset frames by adopting a non-maximum suppression algorithm, thereby obtaining the preset frames only containing the traffic light area and further obtaining the corresponding traffic light area images.

S506, the traffic sign area image is cut to obtain a traffic sign image.

After the traffic light area image is obtained, the traffic light area image can be cut through image processing software, and a redundant background area is removed, so that the traffic light image is obtained. The traffic light image only comprises a traffic light area, then the traffic light image only comprising the traffic light area is adjusted according to different sizes and scaling ratios, and the size of the traffic light image is s1 × s1 by adjusting the traffic light area image according to a preset scaling ratio according to the side length s1 of an actually defined target size.

And S507, inputting the traffic sign image into a classification network to detect the sign type, so as to obtain the pixel information and the shape information of the traffic sign image.

S508, semantic prediction is carried out on the basis of the pixel information and the shape information, the prediction semantics of the traffic sign image are obtained, and the traffic sign recognition result of the traffic sign image to be detected is determined according to the prediction semantics.

Specifically, as shown in fig. 16, after obtaining the traffic light image, and the image size of the traffic light image is 64 × 64, the traffic light image with the input channel number of 3 and the image size of 64 × 64 is input into the classification network, and is processed by the ConvBnRelu module in the LED classification network, so as to obtain the third feature map with the output channel number of 16 and the image size of 32, and then the third feature map is subjected to feature extraction processing by the separable convolution module in the LED classification network, so as to obtain the fourth feature map with the output channel number of 256 and the image size of 4 × 4, and the fourth feature map is subjected to maximal pooling processing by the pooling module MaxPool in the LED classification network, so as to obtain the feature map with the output channel number of 256 and the image size of 1, and is processed by the full connection layer FC in the LED classification network, so as to obtain the feature map with the output channel number of C1 and the image size of 1, and is processed by the activation function, so as to obtain the output result that the output position vector is represented by the probability value of the output vector C1, and the output position vector is represented by the output position vector C1. Then, the probability value is compared with a preset threshold value, a list corresponding to the position with the probability value larger than the preset threshold value can be used as a prediction result of the traffic light, and the traffic light identification results of the traffic image to be detected are supposed to be respectively 'turn-around green light', 'left turn red light' and 'right turn green light'.

After obtaining a third feature map with an image size of 32 × 32, the separable convolution module may input the first DW-PW layer into the separable module to perform convolution extraction processing, so as to obtain a feature map with an output channel number of 64 and an image size of 16 × 16, and then perform convolution extraction processing on the feature map with the image size of 16 × 16 through the second DW-PW layer in the separable module, so as to obtain a feature map with an output channel number of 128 and an image size of 8 × 8, and similarly, perform convolution extraction processing on the feature map with the output channel number of 128 and an image size of 8 × 8 through the third DW-PW layer in the separable module, so as to obtain a fourth feature map with an output channel number of 256 and an image size of 4 × 4. Wherein, the fourth feature map comprises a plurality of feature points.

Further, when the automatic driving assistance driving is required, the computer device can perform semantic integration processing on the traffic light recognition results of the turning green light, the turning left red light and the turning right green light, so that semantic prompt information of turning around, turning right and forbidding turning left is obtained through analysis, and the vehicle can execute corresponding operation according to the semantic prompt information.

In the embodiment, the traffic sign body detection can be performed on the traffic image to be detected, so that a coarse-grained traffic area image is obtained, the traffic sign detection can be performed on the area corresponding to each anchor frame more finely by determining the N anchor frames in the traffic area image, the image characteristics of each traffic sign area in the traffic area image can be extracted more finely, then the sign type detection is performed on the traffic sign area image, the fine and comprehensive characteristics are combined to analyze and obtain the traffic sign identification result of the traffic image to be detected, and the accuracy of the traffic sign identification result is improved.

It should be noted that while the operations of the method of the present invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

On the other hand, fig. 17 is a schematic structural diagram of a traffic sign recognition apparatus according to an embodiment of the present application. The apparatus may be an apparatus in a terminal or a server, as shown in fig. 17, the apparatus 700 includes:

the traffic body detection module 710 is configured to perform traffic sign body detection on the traffic image to be detected to obtain a traffic body area image corresponding to the traffic image to be detected;

the mark area detection module 720 is configured to determine N anchor frames in the traffic zone image, and perform traffic mark detection on an area corresponding to each anchor frame to obtain a traffic mark area image corresponding to the traffic mark zone image; the size and the position corresponding to each anchor frame in the N anchor frames are different, and N is an integer greater than or equal to 1;

the identification module 730 is configured to perform sign type detection on the traffic sign region image to obtain type information of the traffic sign in the traffic sign region image to be detected, and obtain a traffic sign identification result of the traffic sign region image to be detected based on the type information.

In some embodiments, please refer to fig. 18, the flag area detecting module 720 includes:

an acquisition unit 721 configured to determine N anchor frames in the traffic area image; the corresponding size and position of each anchor frame are different, and N is an integer greater than or equal to 1;

the feature extraction unit 722 is configured to input the traffic body area image into the detection network, perform feature extraction on the area corresponding to each anchor frame, and obtain a position offset and a size offset corresponding to each anchor frame in the traffic body area image;

a determining unit 723, configured to determine a preset frame corresponding to the anchor frame based on the position of the anchor frame, the size of the anchor frame, and the position offset and the size offset corresponding to the anchor frame;

and a filtering unit 724, configured to filter invalid preset frames in the preset frames corresponding to the N types of anchor frames, and obtain a traffic sign area image based on the area corresponding to the remaining anchor frames.

In some embodiments, the feature extraction unit 722 is specifically configured to:

determining a target area of the traffic area image based on the anchor frame;

performing basic convolution processing on the target area to obtain a first characteristic diagram; the basic convolution processing sequentially comprises initial convolution processing, normalization processing and activation function processing;

carrying out separable convolution processing on the first characteristic diagram to obtain a second characteristic diagram;

and performing basic convolution processing on the second characteristic diagram to obtain a position offset and a size offset.

In some embodiments, the apparatus is further configured to:

and carrying out size adjustment processing on the traffic body area image according to the preset size parameters.

In some embodiments, the apparatus is further configured to:

acquiring position information and size information of an image of a traffic area;

based on the position information and the size information of the traffic body area image, image scaling and expansion processing are carried out on the traffic body area image according to preset size parameters, and mapping processing is carried out on the coordinate position of the traffic body area.

In some embodiments, the training process of the detection network comprises:

obtaining M anchor frames in the historical traffic body image; the corresponding size and position of each anchor frame are different, M is an integer greater than or equal to 1, and the historical traffic volume image comprises a marked traffic volume area;

inputting the historical traffic volume image into a first initial network, and performing feature extraction on the historical traffic volume image to obtain position offset and size offset corresponding to each of M anchor frames;

determining a prediction frame corresponding to each of the M anchor frames based on the position of the anchor frame, the size of the anchor frame, and the position offset and the size offset corresponding to the anchor frame;

filtering invalid prediction frames in the prediction frames corresponding to the M anchor frames to obtain a prediction traffic body area of the first initial network;

and based on a loss function between the predicted traffic body area and the marked traffic body area, performing iterative training on the first initial network by adopting an iterative algorithm according to the minimization of the loss function to obtain a detection network.

In some embodiments, the loss function includes a first component, a second component, a third component, and a fourth component;

the first component is used for representing the position offset loss between the prediction frame and the anchor frame;

the second component is used for representing the loss of the size offset between the prediction frame and the anchor frame;

the third component is used for representing the loss between the prediction confidence coefficient corresponding to the prediction box and the real confidence coefficient;

the fourth component is used for representing the loss between the prediction category result corresponding to the prediction box and the real category result.

In some embodiments, the identifying module 730 is specifically configured to:

cutting the image of the traffic sign area to obtain a traffic sign image;

inputting the traffic sign image into a classification network for sign type detection to obtain pixel information and shape information of the traffic sign image;

and performing semantic prediction based on the pixel information and the shape information to obtain the prediction semantics of the traffic sign image, and determining the traffic sign recognition result of the traffic sign image to be detected according to the prediction semantics.

In some embodiments, the training process of the classification network comprises:

acquiring historical traffic sign area images, and determining the historical traffic sign images based on each historical traffic sign area image; each historical traffic sign image is marked with a corresponding historical classification result;

each classification network is trained based on each historical traffic sign image and each historical classification result.

In some embodiments, each apparatus is further configured to:

inputting the historical traffic sign image into a second initial network for sign type detection to obtain pixel information and shape information of the historical traffic sign image;

semantic prediction is carried out based on the pixel information and the shape information of the historical traffic sign image, the prediction semantics of the historical traffic sign image are obtained, and the prediction result of the historical traffic sign image is determined according to the prediction semantics.

And calculating a loss function according to the prediction result and the historical classification result of the historical traffic sign image, minimizing according to the loss function, and iteratively adjusting the parameters of the second initial network by adopting an iterative algorithm to obtain the classification network.

In some embodiments, the apparatus is further configured to:

and carrying out size adjustment and data enhancement processing on the historical traffic sign area image to obtain a historical traffic sign image.

It can be understood that the functions of the functional modules of the traffic sign recognition apparatus in this embodiment may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not described herein again.

In summary, the cell data annotation device provided in the embodiment of the present application, on one hand, after obtaining the guidance information of the traffic body area image, by determining the N anchor frames in the traffic body area image, the traffic sign detection can be performed on the area corresponding to each anchor frame more finely, so that the image features of each traffic sign area in the traffic body area image can be extracted more finely, so as to identify the traffic sign in the image based on the more detailed features, and the identification accuracy of the traffic sign can be effectively improved. On the other hand, by detecting the mark type of the traffic mark area image, more precise and comprehensive characteristics can be combined to determine the traffic mark identification result of the traffic image to be detected, and the identification accuracy of the method provided by the application can be obviously improved compared with the prior art to a certain extent.

In another aspect, an apparatus provided in this embodiment includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the traffic sign identification method as described above.

Referring to fig. 19, fig. 19 is a schematic structural diagram of a computer system of a terminal device according to an embodiment of the present application.

As shown in fig. 19, the computer system 300 includes a Central Processing Unit (CPU) 301 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage section 303 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the system 300 are also stored. The CPU 301, ROM 302, and RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

The following components are connected to the I/O interface 305: an input portion 306 including a keyboard, a mouse, and the like; an output portion 307 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 308 including a hard disk and the like; and a communication section 309 including a network interface card such as a LAN card, a modem, or the like. The communication section 309 performs communication processing via a network such as the internet. A drive 310 is also connected to the I/O interface 305 as needed. A removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 310 as necessary, so that a computer program read out therefrom is mounted into the storage section 308 as necessary.

In particular, according to embodiments of the present application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 303, and/or installed from the removable medium 311. The above-described functions defined in the system of the present application are executed when the computer program is executed by the Central Processing Unit (CPU) 301.

It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor, and may be described as: a processor, comprising: the system comprises a traffic body detection module, a mark area detection module and an identification module. The names of the units or modules do not limit the units or modules in some cases, for example, the traffic detection module may also be described as "detecting a traffic sign in a traffic image to be detected to obtain a traffic area image corresponding to the traffic image to be detected".

As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may be separate and not incorporated into the electronic device. The computer-readable storage medium stores one or more programs that, when executed by one or more processors, perform the traffic sign recognition method described herein:

carrying out traffic sign body detection on a traffic image to be detected to obtain a traffic body area image corresponding to the traffic image to be detected;

determining N anchor frames in the traffic sign area image, and detecting a traffic sign in an area corresponding to each anchor frame to obtain a traffic sign area image corresponding to the traffic sign area image; the corresponding size and position of each of N anchor frames are different, wherein N is an integer greater than or equal to 1;

To sum up, the traffic sign recognition method, the apparatus, the device, and the medium provided in the embodiments of the present application obtain a traffic sign area image corresponding to a traffic image to be detected by performing traffic sign body detection on the traffic image to be detected, then determine N anchor frames in the traffic area image, perform traffic sign detection on an area corresponding to each anchor frame to obtain a traffic sign area image corresponding to the traffic sign area image, perform sign type detection on the traffic sign area image to obtain type information of a traffic sign in the traffic image to be detected, and obtain a traffic sign recognition result of the traffic image to be detected based on the type information. Compared with the prior art, on one hand, after the guidance information of the traffic body area image is obtained, the traffic sign detection can be carried out on the area corresponding to each anchor frame more finely by determining the N anchor frames in the traffic body area image, so that the image features of each traffic sign area in the traffic body area image can be extracted more finely, the traffic signs in the image can be identified based on more detailed features, and the identification accuracy of the traffic signs can be effectively improved. On the other hand, by detecting the mark type of the traffic mark area image, the traffic mark identification result of the traffic image to be detected can be determined by combining more precise and comprehensive characteristics, and the identification accuracy of the method provided by the application can be obviously improved compared with the prior art to a certain extent.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A traffic sign recognition method, comprising:

2. The method of claim 1, wherein detecting a traffic sign in the region corresponding to each of the anchor frames to obtain a traffic sign region image corresponding to the traffic sign region image comprises:

inputting the traffic body area image into a detection network, and performing feature extraction on the area corresponding to each anchor frame to obtain the position offset and the size offset corresponding to each anchor frame in the traffic body area image;

determining a preset frame corresponding to the anchor frame based on the position of the anchor frame, the size of the anchor frame, and the position offset and the size offset corresponding to the anchor frame;

and filtering invalid preset frames in the preset frames corresponding to the N anchor frames, and obtaining the traffic sign area image based on the areas corresponding to the residual anchor frames.

3. The method according to claim 2, wherein performing feature extraction on the region corresponding to each anchor frame to obtain a position offset and a size offset corresponding to each anchor frame comprises:

determining a target area of the traffic body area image based on the anchor frame;

performing separable convolution processing on the first characteristic diagram to obtain a second characteristic diagram;

and performing the basic convolution processing on the second feature map to obtain the position offset and the size offset.

4. The method of claim 1, wherein prior to performing traffic sign detection on the image of the traffic volume, the method further comprises:

and carrying out size adjustment processing on the traffic body area image according to preset size parameters.

5. The method according to claim 4, wherein the resizing the traffic volume area image according to a preset size parameter comprises:

acquiring position information and size information of the traffic area image;

based on the position information and the size information of the traffic body area image, carrying out image scaling and expansion processing on the traffic body area image according to preset size parameters, and carrying out mapping processing on the coordinate position of the traffic body area.

6. The method according to claim 2 or 3, wherein the training process of the detection network comprises:

obtaining M anchor frames in the historical traffic body image; the corresponding size and position of each anchor frame are different, M is an integer greater than or equal to 1, and the historical traffic body image comprises a marked traffic body area;

inputting the historical traffic body image into a first initial network, and performing feature extraction on the historical traffic body image to obtain a position offset and a size offset corresponding to each of the M anchor frames;

for each of the M anchor frames, determining a prediction frame corresponding to the anchor frame based on the position of the anchor frame, the size of the anchor frame, and the position offset and the size offset corresponding to the anchor frame;

and performing iterative training on the first initial network by adopting an iterative algorithm based on the loss function between the forecast traffic body area and the marked traffic body area and according to the minimization of the loss function to obtain a detection network.

7. The method of claim 6, wherein the loss function comprises a first component, a second component, a third component, and a fourth component;

the first component is used to characterize a loss of positional offset between the prediction frame relative to the anchor frame;

the second component is used to characterize a loss of dimensional offset between the prediction box relative to the anchor box;

the third component is used for representing the loss between the prediction confidence coefficient corresponding to the prediction box and the true confidence coefficient;

and the fourth component is used for representing the loss between the prediction class result corresponding to the prediction box and the real class result.

8. The method according to claim 1, wherein performing the sign type detection on the traffic sign region image to obtain the type information of the traffic sign in the traffic sign region image to be detected, and obtaining the traffic sign identification result of the traffic sign region image to be detected based on the type information comprises:

cutting the traffic sign area image to obtain a traffic sign image;

9. The method of claim 8, wherein the training process of the classification network comprises:

acquiring a historical traffic sign area image, and determining the historical traffic sign image based on the historical traffic sign area image; the historical traffic sign image is marked with a corresponding historical classification result;

training the classification network based on the historical traffic sign images and the historical classification results.

10. The method of claim 9, wherein training the classification network based on the historical traffic sign images and the historical classification results comprises:

performing semantic prediction based on the pixel information and the shape information of the historical traffic sign image to obtain the prediction semantics of the historical traffic sign image, and determining the prediction result of the historical traffic sign image according to the prediction semantics;

calculating a loss function according to the prediction result of the historical traffic sign image and the historical classification result, minimizing according to the loss function, and iteratively adjusting parameters of a second initial network by adopting an iterative algorithm to obtain the classification network.

11. The method of claim 9, wherein determining a historical traffic sign image based on the historical traffic sign region image comprises:

12. A traffic sign recognition apparatus, said apparatus comprising:

the traffic body detection module is used for detecting a traffic sign body of a traffic image to be detected to obtain a traffic body area image corresponding to the traffic image to be detected;

and the identification module is used for carrying out mark type detection on the traffic mark area image to obtain the type information of the traffic mark in the traffic mark area image to be detected and obtaining the traffic mark identification result of the traffic mark area image to be detected based on the type information.

13. A computer device, characterized in that the computer device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the processor being adapted to implement the traffic sign recognition method according to any one of claims 1-11 when executing the program.

14. A computer-readable storage medium, characterized in that a computer program is stored thereon for implementing a traffic sign recognition method according to any one of claims 1-11.

15. A computer program product comprising instructions which, when executed, implement a method of identifying a traffic sign according to any one of claims 1 to 11.