CN116259040A

CN116259040A - Method and device for identifying traffic sign and electronic equipment

Info

Publication number: CN116259040A
Application number: CN202310271198.6A
Authority: CN
Inventors: 李宁; 万如; 贾双成; 郭杏荣
Original assignee: Zhidao Network Technology Beijing Co Ltd
Current assignee: Zhidao Network Technology Beijing Co Ltd
Priority date: 2023-03-20
Filing date: 2023-03-20
Publication date: 2023-06-13

Abstract

The application relates to a method, a device and electronic equipment for identifying traffic signs. The method for identifying the traffic sign comprises the following steps: obtaining an image to be identified; processing the image to be identified by using the trained traffic sign identification model to obtain a traffic sign; wherein, the traffic sign recognition model includes: the feature map extraction module is used for extracting a feature map from the image to be identified, and processing the image to be identified and/or at least part of the feature map to obtain an adjustment feature map with part of information missing; and the fusion classification module is used for fusing the characteristic diagrams and the adjustment characteristic diagrams and determining traffic signs based on the fused characteristic diagrams and the adjustment characteristic diagrams. The traffic sign identification method and the traffic sign identification device can improve the identification effect of the traffic sign with image defects and poor image quality of part of traffic signs.

Description

Method and device for identifying traffic sign and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, and an electronic device for identifying traffic signs.

Background

With the rapid development of computer technology and artificial intelligence technology, artificial intelligence technology is applied to more and more scenes such as intelligent traffic, image recognition, etc.

In order to realize automatic driving, accurate identification of traffic signs has an important role. For example, an autonomous vehicle may operate in accordance with traffic regulations, along a lane, turn, speed limit, etc. in accordance with traffic signs. The related art may obtain a traffic sign through image recognition.

The applicant found that the recognition effect of the related art on traffic signs in some special scenes is to be improved. For example, the traffic sign image in the photographed image is partially blocked, or at least a partial area of the traffic sign image in the photographed image is unclear or deformed due to road conditions or the like, so that the accuracy of the traffic sign recognition result cannot meet the user demand.

Disclosure of Invention

In order to solve or partially solve the problems in the related art, the application provides a method, a device and electronic equipment for identifying traffic signs, which can effectively improve the identification effect under the scene that the traffic sign image is partially missing or the image quality of the traffic sign image is partially poor.

A first aspect of the present application provides a method of identifying traffic signs, comprising: obtaining an image to be identified; processing the image to be identified by using the trained traffic sign identification model to obtain a traffic sign; wherein, the traffic sign recognition model includes: the feature map extraction module is used for extracting a feature map from the image to be identified, and processing the image to be identified and/or at least part of the feature map to obtain an adjustment feature map with part of information missing; and the fusion classification module is used for fusing the characteristic diagrams and the adjustment characteristic diagrams and determining traffic signs based on the fused characteristic diagrams and the adjustment characteristic diagrams.

According to some embodiments of the present application, the feature map extraction module includes: the convolutional neural network comprises an input layer and at least two convolutional layers which are sequentially connected in series, and is used for carrying out convolutional operation on an image to be identified to obtain a feature map; the image processing unit comprises a plurality of processing subunits which are respectively connected with the input layer or the convolution layer and are used for processing the images to be identified and/or the characteristic images output by at least part of the convolution layer to obtain the images to be identified and/or the characteristic images with partial information missing.

According to some embodiments of the present application, the fusion classification module includes: the first feature fusion unit is used for fusing the images to be identified and/or the feature images with partial information missing to obtain an adjustment feature image; and the second feature fusion unit is used for splicing the feature images and adjusting the feature images so as to determine traffic signs based on the spliced feature images and the adjusted feature images.

According to some embodiments of the present application, the plurality of processing subunits each correspond to a convolution kernel having at least one element value of zero.

According to some embodiments of the present application, elements of the convolution kernel other than zero conform to a gaussian distribution.

According to certain embodiments of the present application, the graph processing module further comprises: the interpolation unit is used for carrying out bilinear interpolation on the partial information missing feature images to obtain images and/or feature images to be identified with the same size; the feature fusion unit is specifically used for fusing images and/or feature images to be identified with the same size to obtain an adjustment feature image.

According to certain embodiments of the present application, the above method further comprises: associating the traffic sign image data with the annotation data to generate sample data; randomly grouping the sample data to obtain training data and test data; after training the traffic sign recognition model by using the training data, processing the test data by using the trained traffic sign recognition model to obtain a test result; and comparing the test result with the label data of the test data to determine the accuracy of the output recognition result of the traffic sign recognition model.

A second aspect of the present application provides an apparatus for identifying traffic signs, comprising: an image acquisition module and an image recognition module. The image acquisition module is used for acquiring an image to be identified; the image recognition module is used for processing the image to be recognized by utilizing the trained traffic sign recognition model to obtain a traffic sign; wherein, the traffic sign recognition model includes: the feature map extraction module is used for extracting a feature map from the image to be identified, and processing the image to be identified and/or at least part of the feature map to obtain an adjustment feature map with part of information missing; and the fusion classification module is used for fusing the characteristic diagrams and the adjustment characteristic diagrams and determining traffic signs based on the fused characteristic diagrams and the adjustment characteristic diagrams.

A third aspect of the present application provides an electronic device, comprising: a processor; and a memory having executable code stored thereon that, when executed by the processor, causes the processor to perform the method.

A fourth aspect of the present application also provides a computer readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to perform the above-described method.

A fifth aspect of the present application also provides a computer program product comprising executable code which when executed by a processor implements the above method.

According to the traffic sign identifying method, the traffic sign identifying device and the electronic equipment, the traffic sign is identified by extracting the characteristics of the target object in the image to be identified. When the characteristics of the target object are extracted, the problems of incomplete traffic sign images or poor partial image quality and the like caused by shielding, road conditions and the like are considered. The traffic sign recognition model actively discards partial information of the image to be recognized and/or partial information of the feature map when extracting the image features, and simulates scenes such as incomplete traffic sign map or poor quality. By the method, the adjustment feature map extracted by the trained traffic sign recognition model can better cope with traffic sign incomplete or partial poor-image-quality scenes. According to the method and the device for identifying the traffic sign, the identification effect of the traffic sign image in the scene with partial missing or poor image quality of the traffic sign image can be effectively improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.

FIG. 1 illustrates an exemplary system architecture that may be applied to a method, apparatus, and electronic device for identifying traffic signs according to embodiments of the present application;

FIG. 2 schematically illustrates an application scenario diagram for identifying traffic signs according to an embodiment of the present application;

FIG. 3 schematically illustrates a flow chart of a method of identifying traffic signs according to an embodiment of the present application;

FIG. 4 schematically illustrates a topology of a traffic sign recognition model according to an embodiment of the present application;

fig. 5 schematically illustrates a structural diagram of a feature map extraction module according to an embodiment of the present application;

FIG. 6 schematically illustrates a structural schematic of a traffic sign recognition model according to an embodiment of the present application;

FIG. 7 schematically illustrates a schematic diagram of an adjustment profile according to an embodiment of the present application;

FIG. 8 schematically illustrates a schematic diagram of a bilinear interpolation calculation process according to an embodiment of the present application;

FIG. 9 schematically illustrates a block diagram of an apparatus for identifying traffic signs according to an embodiment of the present application;

fig. 10 schematically shows a block diagram of an electronic device according to an embodiment of the application.

Detailed Description

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

With the development of deep learning, deep learning algorithms based on image data training, such as a full convolutional network (Fully Convolutional Network, abbreviated as FCN), a face recognition network (facenet), a lightweight convolutional network (e.g., mobilent), etc., are continuously developed. The technology has good effect of identifying complete and clear traffic sign images in the photographed images. However, through a real road driving test, the applicant finds that the road conditions of the road are quite different and the states of traffic signs are quite various, and the related technology is difficult to accurately identify the traffic signs which are partially blocked, incomplete or poor in quality. This results in a need to enhance the recognition effect of traffic sign images that are incomplete or of poor quality in the captured image. For example, images of the car itself or other obstacles obscure portions of the image of the traffic sign. For example, the road surface of the traffic road is uneven, resulting in a loss or deformation of the traffic sign image portion on the road surface in the captured image. For example, a part of a traffic sign image in a photographed image is missing or a part of the image is blurred due to rut coverage or traffic sign aging damage or the like. Under these scenes, the related technology is easy to give out wrong traffic sign information, and brings potential safety hazards. Therefore, it is important to better improve driving safety to identify traffic signs of related traffic signs as accurately as possible, such as not outputting erroneous identification results for traffic signs with partial image deletion or poor image quality.

For example, in the identification scenario of high-precision map ground identification, there may be a very complex case. Such as vehicle glands, road surface aging, different road surface materials, etc. However, high-precision data cannot completely cover the complex scenes, and can only simulate various situations by means of a traffic sign recognition model.

In order to improve the recognition effect for the above-mentioned various cases, the related art may employ a convolutional neural network to extract image features of objects having different sizes in a layer-by-layer abstract manner. Feature fusion of different dimensions is then achieved by e.g. Skip Connect. This way of processing helps to achieve recognition effects for objects of different sizes in the graph. However, this approach still has its limitations. For example, according to experiments, the quality of the reconstructed image is obviously improved, but only more complete traffic signs in the image can be better identified, and the identification effect on traffic signs with incomplete or partial image quality is still to be further improved.

In addition, the related technology can also adopt a support vector machine (svm), an iterative algorithm (such as adaboost), a decision tree (decision tree) and other traditional machine learning algorithms for recognition, but the problems of poor recognition effect, low recognition efficiency, incapability of processing image data in parallel and the like exist, so that a high-precision map image recognition task cannot be completed.

Furthermore, a deep learning algorithm trained based on image data is successively developed. For example, U-net, semantic segmentation (e.g., segNet, deepLab), etc. By using these algorithms to test in the experimental data of the present application, the applicant found that the recognition effect was poor. The main appearance is that: pavement markers are incomplete or unclear and difficult to accurately identify. The recognition effect cannot meet the requirement of the actual application scene on the recognition accuracy, so that a new recognition scheme suitable for the scene is required.

The application provides a high-precision map ground mark recognition scheme based on confidence convolution. And extracting image information by using convolution and Gaussian-like operators, and outputting the extracted multi-layer information after fusion. The scheme effectively improves the identification accuracy and has good identification effect in engineering application.

Specifically, the application utilizes a trained traffic sign recognition model to process an image to be recognized, and the traffic sign recognition model utilizes a Gaussian-like operator to actively discard partial information of the image to be recognized and/or partial information of a feature map when extracting image features, so that a scene of incomplete traffic sign map or poor quality is simulated. Therefore, the adjustment feature map extracted by the trained traffic sign recognition model can better cope with the scene of traffic sign deficiency or poor image quality of part, and the traffic sign recognition effect is effectively improved, especially for the application scene of partial missing of images or poor image quality of part of traffic sign images.

A method, apparatus and electronic device for identifying traffic signs according to embodiments of the present application will be described in detail below with reference to fig. 1 to 10.

Fig. 1 illustrates an exemplary system architecture that may be applied to a method, apparatus, and electronic device for identifying traffic signs according to embodiments of the present application. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present application may be applied to help those skilled in the art understand the technical content of the present application, and does not mean that the embodiments of the present application may not be used in other devices, systems, environments, or scenarios.

Referring to fig. 1, a system architecture 100 according to this embodiment may include

mobile platforms

101, 102, 103, a network 104, and a cloud 105. The network 104 is the medium used to provide communication links between the

mobile platforms

101, 102, 103 and the cloud 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. Mobile terminals, such as cameras and lidars, can be mounted on the

mobile platforms

101, 102, 103 to realize the functions of identifying traffic signs, identifying obstacles, shooting videos, and the like.

The user may interact with other mobile platforms and cloud 105 over network 104 using

mobile platforms

101, 102, 103 to receive or send information, etc., such as sending model training requests, model parameter download requests, and receiving trained model parameters, etc. The

mobile platforms

101, 102, 103 may be installed with various communication client applications, such as, for example, driving assistance applications, autopilot applications, vehicle applications, web browser applications, database class applications, search class applications, instant messaging tools, mailbox clients, social platform software, and the like.

Mobile platforms

101, 102, 103 include, but are not limited to, automobiles, robots, tablet computers, laptop portable computers, and the like, which may support functions such as surfing the internet, acquiring point cloud data, capturing video, man-machine interaction, and the like.

The cloud 105 may receive a model training request, a model parameter downloading request, etc., adjust model parameters to perform model training, issue a model topology, issue trained model parameters, etc., and may also send road weather information, real-time traffic information, etc. to the

mobile platforms

101, 102, 103. For example, the cloud 105 may be a background management server, a server cluster, a car networking, or the like.

It should be noted that the number of servers in the mobile platform, network and cloud are merely illustrative. There may be any number of mobile platforms, networks, and clouds, as desired for implementation.

Fig. 2 schematically illustrates an application scenario diagram for identifying traffic signs according to an embodiment of the present application.

Referring to fig. 2, the figure is a partial picture in an image photographed by a photographing device (which may be provided at a fixed position or on a moving platform). In the related art, clear pavement markers in the image can be better identified. However, if a vehicle mark or other shielding object exists on the pavement marker, or a part of the area of the pavement marker is shielded, the accuracy of the recognition result cannot meet the application requirement. For example, the right turn through flag at the position of the circle 1 in fig. 2 may be judged as two separated parts due to the influence of rut or the like, and it cannot be recognized as the left turn flag. The marking at the position of circle 2 in fig. 2 is arranged on a colored road surface, the color (gray value) of the traffic marking is similar to the color (gray value) of the road surface, and the tire ruts are superimposed, so that the correct traffic marking is more difficult to identify. The difficulty in identifying the position of the circle 3 in fig. 2 is also great, and on the basis of the difficulty in identifying the circle 2, the partial area of the traffic sign is shielded by the running vehicle, so that the traffic sign image in the image is incomplete, and the difficulty in identifying is improved.

In the embodiment of the present application, on the basis of the feature map extracted by the recognition model provided by the related technology, an information processing process for the image to be recognized and/or the feature map is further added, and an adjustment feature map is obtained by actively discarding part of the information, so that a scene with a partially missing traffic sign image or a partially poor image quality in a complex application scene as in fig. 2 can be simulated by using the adjustment feature map. Therefore, the recognition result obtained based on the spliced characteristic diagram and the adjustment characteristic diagram is more accurate.

Fig. 3 schematically shows a flow chart of a method of identifying traffic signs according to an embodiment of the present application.

Referring to fig. 3, the embodiment provides a method for identifying traffic signs, which includes operations S310 to S320, as follows.

In operation S310, an image to be recognized is obtained, the image to be recognized including at least one traffic sign image.

In this embodiment, the image to be recognized may be one frame of image in a video captured by a capturing device provided on the mobile platform. Among them, mobile platforms include, but are not limited to: any one of a vehicle, robot, vessel, or aircraft. For example, the image to be recognized may be obtained by photographing by a photographing device provided on the vehicle. For example, the photographing device may be a vehicle recorder or the like.

The camera may be a monocular camera. In addition, a binocular shooting device and the like can be used, and after video fusion is carried out on two images to be identified shot by the binocular shooting device, traffic sign identification is carried out on the spliced images to be identified.

The image to be identified can include, but is not limited to, traffic sign images, such as turn signs, turn around signs, guideboard signs, and the like. The image to be identified may include at least a portion of the image of the mobile platform itself or may not include the image of the mobile platform itself. In addition, various man-made objects, non-man-made objects, etc., such as buildings, vehicles, pedestrians, trees, etc., may be included.

In operation S320, the image to be recognized is processed using the trained traffic sign recognition model, resulting in a traffic sign.

In this embodiment, the traffic sign recognition model may be a pre-trained model capable of determining whether the inputted image to be recognized includes a traffic sign image or dividing the traffic sign image from the image to be recognized. The traffic sign recognition model may be a plurality of types of neural networks, etc.

Fig. 4 schematically shows a topology diagram of a traffic sign recognition model according to an embodiment of the present application. Referring to fig. 4, the traffic sign recognition model includes: and the feature map extraction module and the fusion classification module.

The feature map extraction module is used for extracting a feature map from the image to be identified. In addition, the feature map extraction module may be further configured to process the image to be identified and/or at least a part of the feature map to obtain an adjusted feature map with partial information missing. For example, the image information of the partial region in the image to be recognized may be discarded according to a preset rule, a preset algorithm, or the like. For example, at least some of the feature information in the feature map may be discarded according to a preset rule, a preset algorithm, or the like.

Fig. 5 schematically shows a structural schematic diagram of a feature map extraction module according to an embodiment of the present application.

Referring to fig. 5, the feature map extracting module may include a convolutional neural network, including an input layer and at least two convolutional layers sequentially connected in series, for performing a convolution operation on an image to be identified to obtain a feature map. Wherein the convolutional neural network may include at least one convolutional pair, each of which may include a pair of convolutional layers and a pooling layer.

In addition, the feature map extraction module may further include a map processing unit. The image processing unit comprises a plurality of processing subunits which are respectively connected with the input layer or the convolution layer and are used for processing the images to be identified and/or the characteristic images output by at least part of the convolution layer to obtain the images to be identified and/or the characteristic images with partial information missing. For example, each convolution layer may have a unique corresponding processing subunit, configured to process the feature map output by the convolution layer, to obtain an image to be identified and/or a feature map with partial information missing.

It should be noted that the feature map extraction module may also be a more complex network structure. For example, the feature map extraction module may employ an encoding module that encodes at least two dimensional feature maps output by the blocks in the encoding module. For example, the coding module may comprise a plurality of coding blocks in cascade, starting with the coding blocks of the image to be identified, the dimensions of the features extracted by each coding block being from lower to higher layers.

The higher layer coding block has larger receptive field and strong characteristic image characterization capability, but the resolution of the characteristic image is low, and the characterization capability of geometric information is weak (the detail of space geometric characteristics is lack). The receptive field of the low-layer coding block is smaller, the geometric detail information characterization capability is strong, and the characteristic diagram characterization capability is weak although the resolution ratio is high. The high-level feature map can help accurately identify or segment the target. Therefore, at least part of features with different dimensions are fused together in deep learning, and recognition and segmentation effects can be effectively improved.

Referring to fig. 4, the fusion classification module is configured to fuse the feature map and the adjustment feature map, and determine a traffic sign based on the fused feature map and the adjustment feature map. For example, the fusion classification module may perform a fusion operation on the feature maps of different channels to classify based on the fused feature maps. Among them, fusion operations include, but are not limited to: add (add) or splice (contact). Wherein using add fusion is equivalent to information fusion of the corresponding channels, and using concate fusion is equivalent to fusing information of all channels together (by convolution kernels). The add calculation amount is smaller than the contact calculation amount.

For example, the high-dimensional feature map may be restored step by step to the same size as the image to be identified by means of upsampling, where the segmented traffic sign and the identification result may be included.

In one particular embodiment, semantic information is extracted from a video frame by a convolution layer. In particular, the convolutional neural network may include a plurality of convolutional pairs, and a pooling layer disposed between a portion of adjacent convolutional pairs. For example, each convolution pair includes adjacently disposed convolution layers and an activation layer, the convolution kernel size (convolutional kernel size) of the convolution layers may be 3, or the like. The feature map fill width (padding) may be 1 and the step size (stride) may be 1. In addition, a normalization layer can be further included between the convolution layer and the activation layer to normalize the extracted features. The convolution kernel size of the pooling layer may be 2, the feature map fill width may be 0, and the step size may be 2.

In addition, convolutional neural networks may also include more or fewer pooling layers. The video frames can be subjected to feature extraction through the convolution pairs and the pooling layer, and feature images can be output. For example, convolutional layer parameters: kernel size=3, padding=1, stride=1. Pooling layer parameters: kernel size=2, padding=0, stride=2.

Where padding=1 makes the resolution of the video frame (x+2) × (y+2), and after convolution with a convolution kernel of 3×3 size, the resolution of the output matrix is x×y. The above-described convolutional layer parameter setting makes the input image and the output matrix of the convolutional layer the same size.

Pooling is the use of the overall statistical characteristics of the neighboring outputs of a location instead of the network's outputs at that location, which has the advantage that most of the outputs after the pooling function remain unchanged when the input data makes small shifts. Pooling may enable compression of pictures. The larger the picture is, the greater the processing speed and the recognition difficulty are. The size of the picture can be reduced by a pooling process. Pooling preserves most of the important information while reducing the dimensions of the individual feature maps,

for example, when identifying whether an image includes a turn indicator, it may be useful to pool pixels of a region to obtain an overall statistical feature if it is detected that the image to be identified includes a polyline and a triangle at one end of the polyline without knowing the exact location of the turn indicator. As the feature map becomes smaller after pooling, if the back connection is a full connection layer, the number of neurons can be effectively reduced, the storage space is saved, and the calculation efficiency is improved.

At present, the methods of maximum pooling, average pooling, addition pooling and the like mainly exist. The method of leaving one maximum pixel in the four pixel grids and discarding the other three pixels is selected by a max pooling algorithm. Pooling, i.e., spatial pooling, achieves relatively lower dimensions by performing aggregate statistical processing on different features while avoiding overfitting. Wherein, the average pooling is to calculate the average value of the image area and take the average value as the value after the area is pooled. Maximum pooling is to select the maximum value of the image area and take the maximum value as the value after the area is pooled. A spatial neighborhood is defined, and the largest element is taken from the corrected feature map, or an average value is taken.

The spatial scale of the input representation can be gradually reduced through the pooling operation, the feature dimension is reduced, and the parameters and the number of calculation in the network are reduced more controllably. Making the network invariant to smaller variations, redundancies, and transformations in the input image. Helping to obtain the dimensional invariance of the image to the greatest extent.

In some embodiments, the fusion classification module may include: a first feature fusion unit and a second feature fusion unit.

The first feature fusion unit is used for fusing the images to be identified and/or the feature images with partial information missing to obtain an adjustment feature image.

The second feature fusion unit is used for splicing the feature images and adjusting the feature images so as to determine traffic signs based on the spliced feature images and the adjustment feature images.

Fig. 6 schematically shows a schematic structural diagram of a traffic sign recognition model according to an embodiment of the present application.

Referring to fig. 6, the first feature fusion units are respectively connected to respective sub-processing units of the image processing unit. And after the sub-processing unit performs active loss part information processing on the image in the channel, the output result is transmitted to the first feature fusion unit. The first feature fusion unit fuses the feature graphs output by the sub-processing units and transmits the fusion result to the second fusion unit.

To reduce the amount of computation, the first feature fusion unit may use an add fusion approach. In order to avoid that the feature map output by the convolution layer covers the missing information in the adjustment feature map output by the first feature fusion unit, the second feature fusion unit may use a contact fusion mode. In addition, the second feature fusion unit is beneficial to improving the number of channels by using a contact fusion mode, providing feature graphs with more dimensions and improving the recognition accuracy of traffic signs.

An exemplary implementation of actively discarding partial information of the image to be processed and/or of the feature map is described below.

In some embodiments, the active discarding of partial information may be implemented by a specific convolution algorithm. Specifically, the plurality of processing subunits each correspond to a convolution kernel having at least one element value of zero. Wherein the convolution kernel may employ a gaussian-like operator. For example, elements of the convolution kernel other than zero conform to a gaussian distribution.

Fig. 7 schematically shows a schematic view of an adjustment profile according to an embodiment of the present application. The embodiment can utilize a specific convolution kernel to carry out convolution operation on the image to be identified and/or the feature map, and a certain weight element in the convolution kernel is set to 0, so that the image information deletion of the position corresponding to the 0 weight element can be simulated.

Referring to fig. 7, an exemplary description will be given of a processing object in which an image to be recognized is a sub-processing unit. The image to be identified is provided with traffic signs such as lane lines and the like for guiding traffic order. The lane line at the position of the convolution kernel in fig. 7 is missing (the area indicated by the thin broken line on the lane line), which easily results in failure to correctly recognize the lane line. In this embodiment, a gaussian-like operator is used for convolution operation, where the gaussian-like operator is obtained based on a processed gaussian template, for example, by performing normalization operation on the gaussian template to obtain the gaussian-like operator. Wherein the weight of the lower right corner in the processed gaussian template is 0. Therefore, when model training is carried out by utilizing the corresponding Gaussian-like operator, partial area information of the lane line can be actively discarded to simulate a scene of partial missing of the lane line in reality. When the trained traffic sign recognition model processes the image with the missing lane line part in the real environment, the scene can be better dealt with, and a more accurate recognition result can be obtained.

The processed Gaussian templates can be designed according to preset information discarding proportions. For example, if the confidence is preset to 0.9, i.e., about 90% of the information is retained, a 3×3 convolution kernel may be employed, with one weight in the convolution kernel set to 0.

In addition, referring to fig. 6, since there are a plurality of sub-processing units, the gaussian-like operators used by the respective sub-processing units may be the same or different. For example, the gaussian-like operator used by all sub-processing units in fig. 6 is the weight value of 0 for the lower left corner element. For example, in fig. 6, the weight value of the left lower corner element of the gaussian-like operator used by the first sub-processing unit is 0 from left to right, and the weight value of the middle element of the gaussian-like operator used by the second sub-processing unit is 0.

It should be noted that, the sub-processing unit adopts a convolution operation based on a gaussian operator to realize the active discarding of part of the information, which is also helpful to remove at least part of the noise information in the current step, and is also helpful to further improve the quality of the adjustment feature map.

Furthermore, the fusion classification module may also include a decoder portion. The decoder section performs an upsampling operation. For example, the input image of the encoder is 480×800, and each layer performs a downsampling operation to double the number of channels, so that the length and width of the image are reduced to 1/2. The upsampling operation is opposite to the downsampling operation.

In particular, the task of the decoder is to semantically map discriminable features learned by the encoder (feature maps, which have a lower resolution) to pixel space (higher resolution) to obtain dense classification.

The decoder may have a complex structure or a simple structure. For example, for a simple decoder, a sort head (e.g., mlp, etc.) may be included followed by an activation layer (e.g., softmax). The convolutional neural network and the graph processing unit described above are used to extract features as an encoder (encoder). A decoder (decoder) may implement decoding the feature map into a desired segmentation result and/or classification result.

For example, the way a decoder upsamples its lower resolution input feature map. The non-linear upsampling is performed, for example, by the decoder using the pooled index calculated in the maximum pooling step of the corresponding encoder. This approach eliminates the need to learn up-sampling. The up-sampled feature map is sparse, so then a convolution operation is performed using a trainable convolution kernel, generating a dense feature map.

For example, the fusion classification module may be formed by a classification layer following the decoding network after the second feature fusion unit. The coding network may be composed of a plurality of convolutional layers, each layer of encodings corresponding to a layer of encodings. A multi-class soft-max classifier is followed by a decoder network output to generate class probabilities for each pixel.

For example, the decoder may use the same convolution layer as the first 13 layers of VGG16, and the weight values obtained by training VGG16 on a large data set are used as the weight initial values of the coding network, and the coding network may be composed of 13 convolution layers in order to preserve the feature maps (feature maps) of the highest level output by the encoder to high resolution.

In the embodiment, the confidence coefficient (information retention rate) and the Gaussian-like operator are used for extracting the adjustment feature map from the image to be identified and the feature map, fitting the scene of the traffic sign image information missing, and effectively improving the identification effect of the traffic sign part image missing scene. In addition, proper confidence can avoid the error of the recognition result caused by the loss of excessive information.

In some embodiments, since the size of the image to be identified and the feature map may be different, the feature map needs to be processed so that the size of the feature map and the size of the image to be identified remain consistent for fusion. Specifically, the feature map may be subjected to interpolation processing so that the image to be identified and the feature map have the same size, so that feature fusion is performed. For example, linear interpolation, bilinear interpolation, nearest Neighbor interpolation algorithm (Nearest Neighbor), etc. may be employed.

For example, the graph processing unit may also include an interpolation subunit. The interpolation subunit is used for carrying out bilinear interpolation on the characteristic diagram with partial information missing to obtain images and/or characteristic diagrams to be identified with the same size. Correspondingly, the first feature fusion unit is specifically used for fusing the images to be identified and/or the feature images with the same size to obtain the adjustment feature images.

An exemplary description is given below of a bilinear interpolation process. Fig. 8 schematically illustrates a schematic diagram of a bilinear interpolation calculation process according to an embodiment of the present application.

The calculation formula of bilinear interpolation is similar to that of nearest neighbor method. Not the closest 1 point is found according to the correspondence but the closest 4 points. Referring to fig. 8, bilinear interpolation is calculated by calculating 3 times of bilinear interpolation (x-axis twice, y-axis once) in two directions respectively, and as shown in fig. 8, 2 times of the bilinear interpolation are calculated in the x-direction to obtain two temporary points of R1 (x 1, y 1) and R2 (x 2, y 2), and P (x, y) is calculated by calculating 1 time of the bilinear interpolation in the y-direction (the same result is obtained by actually exchanging the directions of 2 axes with y first and then x).

The weight of each point is related to the distance between the point to be solved and the corner point, for example, the weight of f (Q11) is related to the coordinate of f (Q22), and the weight of f (Q12) is related to the coordinate of f (Q21).

In the embodiment, the bilinear interpolation is adopted to enable the sizes of the feature images and the images to be identified to be consistent on the basis that the quality of the feature images is not reduced, so that the feature fusion convenience is improved.

In some embodiments, the object edge information in the image can be further identified, and the object edge information is used as an edge feature and is fused with the feature map and the adjustment feature map of the image to be identified, so that the identification effect of the traffic sign is enhanced. Referring to fig. 2, if a better edge recognition algorithm is adopted for the steering mark in the circle 1, the edge of the steering mark can be accurately recognized, which is helpful for improving the recognition accuracy of the traffic mark.

Specifically, the edge feature can be extracted by finding the position in an image where the gray-scale intensity variation is strongest. The direction in which the gray scale intensity changes most strongly is referred to as the gradient direction.

In some embodiments, the feature map extracting module may include: an intensity gradient determination unit, a candidate pixel obtaining unit, and an edge feature determination unit.

The intensity gradient determining unit is used for determining gradient amplitude and gradient direction of each pixel in the gray level image of the image to be identified.

The candidate pixel obtaining unit is used for obtaining one or a plurality of (Top N) candidate pixels with highest gradient amplitude along the gradient direction.

The edge feature determining unit is used for taking the first type candidate pixels as edge features and deleting the second type candidate pixels, wherein the gradient amplitude of the first type candidate pixels is larger than or equal to an upper limit threshold value, the gradient amplitude of the second type candidate pixels is smaller than a lower limit threshold value, and the upper limit threshold value is larger than the lower limit threshold value.

In a specific embodiment, the gradient of each pixel in the image may be obtained by an operator (convolution kernel), such as a laplace (Laplacian) operator, a Sobel (Sobel) operator, or the like. For example, the image and video processing libraries (e.g., opencv) in the computer vision field have encapsulated functions that can be used to derive the n-th derivative of each pixel in the image. First, the above convolution kernels are used to respectivelyObtaining gradient G along horizontal (x) and vertical (y) directions _X And G _Y . Thus, the gradient amplitude of each pixel point can be obtained by utilizing a formula.

In addition, G may be used for simplicity of calculation _X And G _Y Instead of the two norms. Each pixel point in the gray scale map is replaced by G, and a larger gradient value G is obtained where the pixel brightness value of the new map changes drastically (at the edges).

However, the edges in this figure may be very thick, making it difficult to pinpoint the true position of the edges. To solve this problem, more accurate edge information is determined based on the gradient direction information and the gradient value G. Specifically, the maximum value of the gradient intensity at each pixel point is retained, while the other values are deleted. For example, the gradient intensity of each adjacent pixel point in the gradient direction of a specific pixel point may be compared with the gradient intensity of the specific pixel point, and Top (e.g., top 1) pixels with gradient intensity may be found as candidate pixels. For example, pixels other than Top n pixels may be deleted, such as setting the value of the pixel to zero.

After the processing, some noise points may still exist in the image, and the noise points can be removed in a double-threshold mode, and the edge pixels are prevented from being deleted by mistake. Specifically, an upper limit threshold and a lower limit threshold are set. The gradient intensity of a pixel in an image is considered to be necessarily an edge if it is greater than an upper threshold, and is considered to be not necessarily an edge if it is less than a lower threshold.

In some embodiments, it is contemplated that the gradient strength of some pixels in the image is between an upper threshold and a lower threshold, which may include some edge pixels. The method and the device judge whether the lines formed by the pixel points are edge pixel points or not by judging whether the lines formed by the pixel points are connected with the lines formed by the pixel points with strong edges or not.

Specifically, the feature map extraction module may further include a weak edge determination unit. The weak edge determining unit is used for taking a third type of candidate pixels as edge characteristics, wherein the gradient amplitude of the third type of candidate pixels is larger than or equal to a lower limit threshold value, the gradient amplitude of the third type of candidate pixels is smaller than an upper limit threshold value, and a line formed by the third type of candidate pixels is connected with a line formed by the first type of candidate pixels.

In this embodiment, edge information in an image can be effectively extracted. After the edge information is fused with the feature map and the adjustment feature map, the drawing of the edges in the feature map is equivalent to that of the edges in the feature map, so that the edges in the feature map are more obvious, and the recognition effect of traffic signs is improved.

In some embodiments, the traffic recognition model may be model trained as follows. In particular, the above method may further comprise the following operations. Firstly, associating traffic sign image data and annotation data to generate sample data; then, randomly grouping the sample data to obtain training data and test data; then, after training the traffic sign recognition model by using the training data, processing test data by using the trained traffic sign recognition model to obtain a test result; and then, comparing the test result with the label data of the test data, and determining the accuracy of the output recognition result of the traffic sign recognition model. For example, an identification model with sufficiently high prediction accuracy can be trained by a back propagation algorithm.

The marked sample images are divided into a training sample set and a test sample set according to a preset proportion, so that training effect can be effectively improved. The training sample set can be used for model training, and the test sample set is used for testing the trained model so as to ensure that the trained model can achieve the expected recognition effect.

It should be noted that, the technical solution of the present application can be well applied to a scene of a video shot by a shooting device during a moving process of a vehicle, the vehicle may move at a high speed under the scene, and a position of a traffic sign in a shot video frame changes rapidly, so that the traffic sign needs to be determined from the video frame rapidly.

In embodiments of the present application, the photographing device may be a monocular camera, a binocular camera, a trinocular camera, or more cameras. For example, a multi-view camera includes a plurality of cameras of different shooting ranges, and illustratively, a tri-view camera may include a first camera, a second camera, and a third camera.

When the model is trained, a corresponding traffic sign recognition model can be trained for each camera of the three-eye cameras. For example, a first traffic sign recognition model for recognizing images acquired by the first cameras is trained for the first cameras, respectively. A second traffic sign recognition model for recognizing images acquired by the second camera is trained for the second camera. A third traffic sign recognition model for recognizing the image captured by the third camera is trained for the third camera. The weighted results of the three traffic sign recognition models are taken as the final results. For another example, a general traffic sign recognition model may be trained for all cameras of the three-eye cameras, and the general traffic sign recognition model may recognize images acquired by each of the three-eye cameras.

In the embodiment of the application, the corresponding sample image can be acquired according to the traffic sign recognition model which needs to be trained. For example, when the traffic sign recognition model to be trained is the first traffic sign recognition model corresponding to the first camera, the sample image is an image collected by the first camera. When the traffic sign recognition model to be trained is a second traffic sign recognition model corresponding to the second camera, the sample image is an image acquired by the second camera. When the traffic sign recognition model to be trained is a third traffic sign recognition model corresponding to the third camera, the sample image is an image acquired by the third camera. For another example, when the traffic sign recognition model to be trained is a traffic sign recognition model common to all cameras of the three-camera, the sample image is an image collected by the three cameras.

In some embodiments, the acquired sample images may be time-sequential synchronized, i.e., the acquired images are ordered in the order of acquisition time. When the sample image is an image acquired by three cameras, the images acquired by the three cameras acquired at the same time can be used as the same group of images, and then the plurality of groups of images are ordered according to the sequence of the acquisition time.

In some embodiments, in order to perform model training by using the acquired sample image, traffic signs of the sample image need to be labeled in advance, such as information of positions and ranges of the traffic signs, color information of the traffic signs, and the like, and the information can be specifically selected according to the purpose of model training.

In some embodiments, the traffic sign recognition model may be trained by a back propagation algorithm. Reference may be made in particular to neural network training methods. In the course of training the basic model, the position information of the traffic sign and the like can be input from the outside.

It should be noted that, the model training process of the traffic sign recognition model may be offline training or online training, and the model training may be performed at the cloud. The computing device of the mobile platform may download the model topology and model parameters of the trained traffic sign recognition model from the cloud in order to enable traffic sign recognition locally at the mobile platform. The mobile platform can also send the video stream to the cloud so that the cloud processes the image to be identified by using the trained traffic sign identification model to obtain traffic sign information in the image to be identified, and the cloud sends (or broadcasts) the traffic sign position information to the mobile platform.

In one embodiment for identifying traffic signs, please refer to fig. 6, first, a confidence level of 0.9 is set (90% of the information is retained). Then, using a 3×3 gaussian operator, each time information is extracted using the gaussian operator, a position of 0 (imitating a scene where image information is missing) is randomly designated. Next, 3 feature maps were obtained by extracting feature maps using a 3×3 gaussian operator at each of layer 1 (input layer), layer 2 (convolution layer), and layer 3 (convolution layer). Then, these 3 feature maps are added (add), and then the added result is channel-spliced (contact) with the feature map output by the network convolution.

In one particular model training process, the following operations may be included. First, image data collected by road traffic and json data (such as the position and type of marked traffic sign) are combined to generate the required sample data. And then, processing the data which does not meet the specification to obtain the data which meets the specification. The resulting data was then divided into test data and training data by random grouping, and the two data were saved separately in mdb database using the program. Then, the mdb data obtained by reading is analyzed into a 480×800×3 matrix, and the matrix is input into a network for training, so that a trained model is obtained. And then, predicting by using the obtained training model, and comparing the prediction result with a real picture label. Then, the improvement of the recognition accuracy is found to be obvious by comparison, and the test is performed in a larger range of actual data.

When the confidence and Gaussian-like operator are used for extracting the image features, the embodiment of the application actively discards partial information of the image to be identified and/or partial information of the feature map, and simulates a scene of incomplete traffic sign map. By the method, the adjustment feature map extracted by the trained traffic sign recognition model can better cope with scenes with traffic sign defects or poor partial image quality, and can be further suitable for various complex scenes. In addition, a neural network with a shallower depth can be used, which is helpful for reducing the complexity of the network, improving the response speed and reducing the consumption of computing resources.

Another aspect of the present application also provides an apparatus for identifying traffic signs.

Fig. 9 schematically shows a block diagram of an apparatus for identifying traffic signs according to an embodiment of the application.

Referring to fig. 900, the apparatus 900 for identifying traffic signs may include: an image acquisition module 910 and an image recognition module 920.

Wherein, the image obtaining module 910 is configured to obtain an image to be identified.

The image recognition module 920 is configured to process the image to be recognized using the trained traffic sign recognition model to obtain a traffic sign. The traffic sign recognition model may include: and the feature map extraction module and the fusion classification module.

The feature map extracting module is used for extracting a feature map from the image to be identified, and processing the image to be identified and/or at least part of the feature map to obtain an adjustment feature map with part of information missing.

The fusion classification module is used for fusing the feature images and the adjustment feature images and determining traffic signs based on the fused feature images and the adjustment feature images.

In some embodiments, the feature map extraction module comprises: a convolutional neural network and a graph processing unit.

The convolutional neural network comprises an input layer and at least two convolutional layers which are sequentially connected in series, and the convolutional neural network is used for carrying out convolutional operation on an image to be identified to obtain a feature map.

The image processing unit comprises a plurality of processing subunits which are respectively connected with the input layer or the convolution layer and are used for processing the images to be identified and/or the characteristic images output by at least part of the convolution layer to obtain the images to be identified and/or the characteristic images with partial information missing.

In some embodiments, the plurality of processing subunits each have a convolution kernel corresponding thereto, at least one element value of the convolution kernel being zero.

In some embodiments, elements of the convolution kernel other than zero conform to a gaussian distribution.

The specific manner in which the respective modules and units perform the operations in the apparatus 900 in the above embodiment has been described in detail in the embodiment related to the method, and will not be described in detail here.

Another aspect of the present application also provides an electronic device.

Referring to fig. 10, an electronic device 1000 includes a memory 1010 and a processor 1020.

The processor 1020 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Memory 1010 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 1020 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 1010 may comprise any combination of computer-readable storage media including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some implementations, memory 1010 may include readable and/or writeable removable storage devices, such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.

The memory 1010 has stored thereon executable code that, when processed by the processor 1020, can cause the processor 1020 to perform some or all of the methods described above.

Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing part or all of the steps of the above-described method of the present application.

Alternatively, the present application may also be embodied as a computer-readable storage medium (or non-transitory machine-readable storage medium or machine-readable storage medium) having stored thereon executable code (or a computer program or computer instruction code) which, when executed by a processor of an electronic device (or a server, etc.), causes the processor to perform part or all of the steps of the above-described methods according to the present application.

The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of identifying traffic signs, comprising:

obtaining an image to be identified;

processing the image to be identified by using the trained traffic sign identification model to obtain a traffic sign;

wherein the traffic sign recognition model comprises:

a feature map extracting module for extracting a feature map from the image to be identified, processing the image to be identified and/or at least part of the feature map to obtain an adjustment feature map with partial information missing,

and the fusion classification module is used for fusing the characteristic map and the adjustment characteristic map and determining the traffic sign based on the fused characteristic map and adjustment characteristic map.

2. The method of claim 1, wherein the feature map extraction module comprises:

the convolutional neural network comprises an input layer and at least two convolutional layers which are sequentially connected in series, and is used for carrying out convolutional operation on the image to be identified to obtain the feature map;

and the image processing unit comprises a plurality of processing subunits which are respectively connected with the input layer and/or at least part of the convolution layers and are used for processing the images to be identified and/or the characteristic images output by at least part of the convolution layers to obtain the images to be identified and/or the characteristic images with partial information missing.

3. The method of claim 2, wherein the fusion classification module comprises:

the first feature fusion unit is used for fusing the images to be identified and/or the feature images with the partial information missing to obtain the adjustment feature images;

and the second feature fusion unit is used for splicing the feature map and the adjustment feature map so as to determine the traffic sign based on the spliced feature map and the adjustment feature map.

4. The method of claim 2, wherein a plurality of the processing subunits each have a convolution kernel, at least a portion of the convolution kernels having at least one element value of zero.

5. The method of claim 4, wherein elements of the convolution kernel other than zero conform to a gaussian distribution.

6. A method according to claim 3, wherein the graph processing unit further comprises:

the interpolation subunit is used for carrying out bilinear interpolation on the partial information-missing feature images to obtain images to be identified and feature images with the same size;

the first feature fusion unit is specifically configured to fuse the image to be identified and the feature map with the same size, and obtain the adjustment feature map.

7. The method according to any one of claims 1 to 6, further comprising:

associating the traffic sign image data with the annotation data to generate sample data;

randomly grouping the sample data to obtain training data and test data;

after the traffic sign recognition model is trained by the training data, the test data is processed by the trained traffic sign recognition model, and a test result is obtained;

and comparing the test result with the label data of the test data to determine the accuracy of the output recognition result of the traffic sign recognition model.

8. An apparatus for identifying traffic signs, comprising:

the image acquisition module is used for acquiring an image to be identified;

the image recognition module is used for processing the image to be recognized by utilizing the trained traffic sign recognition model to obtain a traffic sign; wherein the traffic sign recognition model comprises:

9. An electronic device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method according to any of claims 1-8.

10. A computer storage medium, characterized in that executable code is stored, which when executed performs the method according to any of claims 1-8.