CN112488083B

CN112488083B - Identification method, device and medium of traffic signal lamp based on key point extraction of hetmap

Info

Publication number: CN112488083B
Application number: CN202011553049.1A
Authority: CN
Inventors: 李万清; 李华; 刘俊; 林永杰; 袁友伟
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2024-04-05
Anticipated expiration: 2040-12-24
Also published as: CN112488083A

Abstract

The invention discloses a method, a device and a medium for identifying traffic signal lamps based on a hectmap extraction key point. According to the identifying method of the traffic signal lamp based on the key points extracted by the hematmap, which is provided by the invention, the input picture is subjected to feature extraction and then deconvolution to increase the resolution ratio so as to obtain the downsampled hematmap, so that the receptive field of the network becomes more accurate and even a small target is more sensitive. Second, the assigned anchor points are simply placed in position, without a size box, and without manually setting a threshold to distinguish between foreground and background. Finally, the target is extracted from the hetmap, and a non-maximum suppression algorithm (NMS) is not needed to be used in prediction like yolo3, so that the calculated amount is further reduced, and the prediction speed is improved. The prediction results on the same validation set show that the Yolo3 accuracy is 87.67% in terms of accuracy, and the invention is 95.48%.

Description

Identification method, device and medium of traffic signal lamp based on key point extraction of hetmap

Technical Field

The invention belongs to the field of image recognition, and particularly relates to a method for recognizing traffic signal lamps based on extraction of key points by using a hetmap.

Background

How to correctly identify traffic signals is a vital ring in advanced driving assistance, unmanned and intelligent traffic. In an actual urban driving environment, the effects of light intensity change, motion blur of vehicles and pedestrians and the like bring new problems to signal lamp detection. The traditional digital image processing technology mainly adopts modes of edge segmentation, histogram transformation and the like for traffic light detection, and can perform better under the condition that a certain actual driving environment is good, but once the illumination environment is complex, strong light or backlight and the like, the recognition accuracy of the traditional digital image processing technology is low, and the generalization capability is not achieved.

With the rapid development of artificial intelligence, the deep learning technology is widely applied to various aspects of computer vision, and rapidly develops in the fields of picture classification, target positioning, image segmentation, image enhancement and the like. The target detection algorithm based on deep learning is mainly divided into a two-stage network algorithm and a single-stage network algorithm. The two-stage network firstly generates a sparse candidate region from one image, carries out two-classification and preliminary positioning on the candidate region, and then sends the sparse candidate region into a classification and regression network for further classification and positioning to obtain a final detection result. The single-stage network does not generate candidate areas, and directly classifies and locates anchor blocks (anchors) at fixed positions on the picture.

However, the existing deep network cannot well finish the extraction work of the traffic signal lamp. Taking Yolo3 as an example, the Yolo3 target recognition network is a classical single-phase network, which typically requires 32 times downsampling of the input training pictures, because traffic lights are typically of relatively small size, which has the direct consequence of possibly losing small targets and making errors in predicting anchor frames.

Therefore, how to apply the artificial intelligence technology to accurately extract the traffic signal lamp in the image is a technical problem to be solved urgently at present.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provide a method for identifying traffic signal lamps based on the extraction of key points by using a hetmap. The invention provides a method for identifying traffic lights based on a hectmap extraction key point, which belongs to an anchor-free (anchor-free frame-free) type target detection algorithm, and has high identification accuracy and quick identification speed for small target detection (especially traffic lights).

The specific technical scheme adopted by the invention is as follows:

in a first aspect, the invention provides a method for identifying traffic lights based on a key point extracted by a hetmap, which comprises the following steps:

s1, acquiring an image data set composed of picture samples containing traffic lights, wherein each picture sample is provided with a label, and the label comprises an outsourcing rectangular frame of the traffic lights and a traffic light indication direction category;

s2, constructing a traffic signal lamp identification model by taking a resnet 50 network as a skeleton network, outputting a second feature map after continuous three-layer deconvolution of a first feature map output by the resnet 50 network, and respectively inputting the second feature map into a first output network, a second output network and a third output network; the first output network continuously passes the second feature map through two convolution layers to obtain a hetmap, wherein the hetmap consists of heat maps of all traffic light indication direction categories; the second output network outputs the offset of the center point of each traffic light outsourcing rectangular frame after continuously passing through the two convolution layers; the third output network outputs the height and width of each traffic light outsourcing rectangular frame after continuously passing through the second feature map through the two convolution layers;

s3: training the traffic signal lamp identification model by minimizing a total loss function using the image dataset; total loss function L for single picture sample _det The method comprises the following steps:

L _det ＝L _k +α _off L _off +α _size L _size

wherein: alpha _off And alpha _size Is a weight coefficient;

L _k is a loss of the center point of the outsourced rectangle frame, and:

L _n loss for outsourcing rectangular box n:

wherein: n represents the number of outsourcing rectangular frames N of traffic lights in the picture sample,the central point of the traffic light outsourcing rectangular frame predicted by the first output network in the heat map is locatedPredicted value of the position, Y _n The method comprises the steps of (1) predicting the actual value of the position of the central point of the traffic light outsourcing rectangular frame, which is obtained by prediction of a first output network in a heat map, and (2)>And Y _n The value ranges are all 0,1]The value represents the probability that the point belongs to the center point of the traffic light outsourcing rectangular frame, 0 represents that the point is not the center point, and 1 represents that the point is the center point; alpha, beta are hyper-parameters;

L _off for outsourcing rectangular frame center point offset loss, and:

wherein:the central point offset estimated value of the traffic light outsourcing rectangular frame n output by the second output network is +.>The actual value of the offset of the central point of the rectangular frame n is wrapped for the traffic light; />The vector space is represented, W and H are the width and the height of each picture sample, and R represents the downsampling multiple of the second feature map relative to the picture samples;

L _size for outsourcing rectangular box size loss, and:

wherein:is the estimated value s of the height and the width of the traffic light outsourcing rectangular frame n after R times downsampling _n The actual height and width values of the traffic light outsourcing rectangular frame n after R times downsampling;

s4, inputting a picture to be identified containing traffic signals into the traffic signal identification model, outputting traffic light outsourcing rectangular frame center points of different traffic light indication direction categories by the first output network, outputting offset of each traffic light outsourcing rectangular frame center point by the second output network, and outputting the height and the width of each traffic light outsourcing rectangular frame by the third output network; in the Heatmap, according to the central point, the offset, the height and the width of the output traffic light outsourcing rectangular frame, a calibration frame of the traffic light outsourcing rectangular frame is obtained, and the calibration frame is remapped to the picture to be identified.

Preferably, in the S1, the picture sample is a picture including a traffic light captured by a vehicle recorder or a picture including a traffic light captured by an intersection monitoring camera during driving of the vehicle.

Preferably, in S1, the labeling content of each picture sample is (c, x) ₁ ，y ₁ ，x ₂ ，y ₂ ) C represents the traffic light indication direction category, (x) ₁ ，y ₁ ) And (x) ₂ ，y ₂ ) And respectively representing the left upper corner coordinate and the right lower corner coordinate of the traffic light outsourcing rectangular frame.

Preferably, in the step S2, parameters of each network layer in the traffic light identification model are as follows:

the input picture size of the resnet_50 network is 512×512×3, and the output first feature picture size is 16×16×2048;

the size of a second feature map output by the first feature map after continuous three-layer deconvolution is 128×128×64;

the size of the hetmap output by the first output network is 128 x C; c represents the number of the traffic light indication direction category labels; the window size of the maximum pooling layer is 3×3;

the characteristic size of the offset of the center point of each traffic light outsourcing rectangular frame output by the second output network is 128 x 2;

and the characteristic sizes of the height and the width of each traffic light outsourcing rectangular frame output by the third output network are 128 x 2.

Preferably, in the step S3, the calculating company of the actual value of the center point offset of the traffic light outsourcing rectangular frame n is:

wherein:the floating point coordinate of the center point p of the traffic light outsourcing rectangular frame n in the picture sample after R times downsampling is represented, and the center point +.>And the integer coordinate of the center point p of the traffic light outsourcing rectangular frame after R times of downsampling is represented.

Preferably, in the step S4, the calibration frame of the traffic light outsourcing rectangular frame in the hetmap is:

wherein: (x) ₁ ，y ₁ )、(x ₂ ，y ₂ ) Indicating the upper left and lower right corner coordinates of the calibration frame respectively,the first output network outputs the X coordinate of the central point of a traffic light outsourcing rectangular frame in the hetmap,/for>Is the offset of the central point X of the traffic light outsourcing rectangular frame output by the second output network, < + >>Is a traffic light outsourcing rectangular frame output by the third output networkHeight and width of X.

Preferably, the weight coefficient α _off And alpha _size Set to 1 and 0.1, respectively.

Preferably, the super parameters α, β are set to 2 and 4, respectively.

In a second aspect, the present invention provides a traffic signal lamp identification device based on a hectmap extraction key point, comprising a memory and a processor;

the memory is used for storing a computer program;

the processor is configured to implement the method for identifying traffic signals based on the extraction of key points by the hemmap according to any one of the first aspect when executing the computer program.

In a third aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method for identifying traffic signals based on the extraction of key points by a hetmap according to any one of the first aspects.

Compared with the prior art, the invention has the following beneficial effects:

according to the identifying method of the traffic signal lamp based on the key points extracted by the hematmap, which is provided by the invention, the input picture is subjected to feature extraction and then deconvolution to increase the resolution ratio so as to obtain the downsampled hematmap, so that the receptive field of the network becomes more accurate and even a small target is more sensitive. Second, the assigned anchor points are simply placed in position, without a size box, and without manually setting a threshold to distinguish between foreground and background. Finally, the target is extracted from the hetmap, and a non-maximum suppression algorithm (NMS) is not needed to be used in prediction like yolo3, so that the calculated amount is further reduced, and the prediction speed is improved. The prediction results on the same validation set show that the Yolo3 accuracy is 87.67% in terms of accuracy, and the invention is 95.48%.

Drawings

FIG. 1 is a flow chart of an identification method of the present invention;

FIG. 2 is a collected traffic light picture;

FIG. 3 is a drawing of a completed picture marked using a marking tool;

FIG. 4 is a training flow chart;

FIG. 5 is a graph of the results of distributing keypoints to feature graphs in a Gaussian kernel manner with labeled target boxes;

FIG. 6 is a graph of predicted results in a rainy day environment;

FIG. 7 is a graph of predicted results in a night exposure environment;

fig. 8 is a graph of the prediction results in a good daytime environment.

Detailed Description

The invention is further illustrated and described below with reference to the drawings and detailed description. The technical features of the embodiments of the invention can be combined correspondingly on the premise of no mutual conflict.

As shown in fig. 1, in a preferred embodiment of the present invention, there is provided a method for identifying traffic lights based on a hectmap extraction key point, which comprises the following steps:

s1, acquiring an image data set composed of picture samples containing traffic lights, wherein each picture sample is provided with a label, and the label comprises an outsourcing rectangular frame of the traffic lights and a traffic light indication direction category.

The picture sample can be a picture containing a traffic light shot by a vehicle recorder in the running process of the vehicle or a picture containing a traffic light shot by an intersection monitoring camera. The labeling content of each picture sample is (c, x) ₁ ，y ₁ ，x ₂ ，y ₂ ) C represents the traffic light indication direction category, (x) ₁ ，y ₁ ) And (x) ₂ ，y ₂ ) And respectively representing the left upper corner coordinate and the right lower corner coordinate of the traffic light outsourcing rectangular frame.

S2, constructing a traffic signal lamp identification model by taking a resnet_50 network as a skeleton network, outputting a second feature map after continuous three-layer deconvolution of a first feature map output by the resnet_50 network, and respectively inputting the second feature map into a first output network, a second output network and a third output network; the first output network continuously passes the second feature map through two convolution layers to obtain a heat map, the heat map consists of heat maps of all traffic light indication direction categories, the heat map of each traffic light indication direction category in the heat map is extracted, whether the value of each position in the heat map is not smaller than eight surrounding neighboring positions is judged, and if the value of each position in the heat map is not smaller than the value of each neighboring position, the heat map is used as a hot spot; aiming at the hot spots in the heat map of each traffic light indication direction category, screening out the hot spots with the hot spot position value smaller than the threshold value by setting the threshold value, and outputting the rest hot spots as the center points of the rectangle frame of the traffic light outer package in the heat map of the traffic light indication direction category; the second output network outputs the offset of the center point of each traffic light outsourcing rectangular frame after continuously passing through the two convolution layers; and the third output network outputs the height and width of each traffic light outsourcing rectangular frame after continuously passing through the second feature map through the two convolution layers.

In this embodiment, parameters of each network layer in the traffic signal identification model are as follows:

L _det ＝L _k +α _off L _off +α _size L _size

wherein: alpha _off And alpha _size Is a weight coefficient;

L _k is a loss of the center point of the outsourced rectangle frame, and:

L _n loss for outsourcing rectangular box n:

wherein: n represents the number of outsourcing rectangular frames N of traffic lights in the picture sample,the predicted value of the position of the central point of the traffic light outsourcing rectangular frame, Y, which is obtained by predicting the first output network in the heat map _n The method comprises the steps of (1) predicting the actual value of the position of the central point of the traffic light outsourcing rectangular frame, which is obtained by prediction of a first output network in a heat map, and (2)>And Y _n The value ranges are all 0,1]The value represents the probability that the point belongs to the center point of the traffic light outsourcing rectangular frame, 0 represents that the point is not the center point, 1 represents that the point is the center point, and +.>And Y _n A value closer to 1 indicates that the predicted center point position is closer to the actual center point; alpha, beta are hyper-parameters;

L _off for outsourcing rectangular frame center point offset loss, and:

L _size for outsourcing rectangular box size loss, and:

wherein:is the estimated value s of the height and the width of the traffic light outsourcing rectangular frame n after R times downsampling _n The actual height and width values of the traffic light outsourcing rectangular frame n after being subjected to R times downsampling.

When the loss function is calculated, the actual value of the offset of the central point of the traffic light outsourcing rectangular frame n can be calculated by the labeling value of the picture sample, and the calculation formula is as follows:

wherein:the floating point coordinate of the center point p of the traffic light outsourcing rectangular frame n in the picture sample after R times downsampling is represented, and the center point +.>Integer coordinates representing R times downsampling of central point p of traffic light outsourcing rectangular frame,/->Representing a rounding down.

The calculation formula of the calibration frame for obtaining the traffic light outsourcing rectangular frame in the hetmap according to the center point, the offset, the height and the width of the output traffic light outsourcing rectangular frame is as follows:

wherein: (x) ₁ ，y ₁ )、(x ₂ ，y ₂ ) Indicating the upper left and lower right corner coordinates of the calibration frame respectively,the first output network outputs the X coordinate of the central point of a traffic light outsourcing rectangular frame in the hetmap,/for>Is the offset of the central point X of the traffic light outsourcing rectangular frame output by the second output network, < + >>Is outsourced by the traffic light output by the third output networkThe height and width of the rectangular frame X.

The above method is applied in one embodiment to show its specific implementation and technical effects.

Examples

In this embodiment, the basic flow of the method for identifying traffic lights based on the extraction of key points by the hematmap is to construct a network model and train to obtain a network model with low loss, and then predict by using the network model. The method comprises the following specific steps:

1. data collection

And collecting picture samples containing traffic lights, which are shot by 2 ten thousand intersection monitoring cameras, wherein one picture sample is shown in fig. 2, the traffic lights in the picture are smaller, and a common depth network cannot be directly and accurately identified.

2. Sample labeling

And labeling the picture sample by using labelme open source software. Marking traffic light parts in the pictures in a square frame form, classifying and marking the pictures according to colors and indication directions into red light forward, red light left turn, red light right turn, round red light, green light forward, green light left turn, green light right turn, round green light, yellow light forward, yellow light left turn, yellow light right turn and round yellow light, wherein the marking can generate a JSON (java on) format marking file, and the marking content generally comprises (c, x ₁ ,y ₁ ,x ₂ ,y ₂ ) C represents the traffic light indication direction category, (x) ₁ ,y ₁ ) And (x) ₂ ,y ₂ ) Respectively representing the upper left and lower right corner coordinates of a traffic light outsourcing rectangular frame (hereinafter also referred to as a target frame). Fig. 3 shows a labeled picture sample corresponding to fig. 2.

The picture and the file marked by the square frame are used as an image data set, and the image data set is divided into a training set, a testing set and a verification set according to the proportion of 8:1:1.

3. Traffic signal lamp identification model construction

And (3) constructing a traffic signal lamp identification model, inputting a test set picture into a resnet_50 network to extract features and train the test set picture, deconvoluting the feature picture three times to obtain a feature picture with higher resolution, adding three networks on the basis to calculate the width (width) and height (height) of a key point and a frame corresponding to the key point, and correcting the positions of the key point and the frame by using the offset. The recognition model is described in detail below

Referring to fig. 4, the traffic light recognition model uses a resnet_50 network as a skeleton network, and the structure of the resnet_50 network belongs to the prior art and is not described in detail. The output of the resnet_50 network is recorded as a first characteristic diagram, the first characteristic diagram is subjected to continuous three-layer deconvolution to output a second characteristic diagram, and the second characteristic diagram is respectively input into the first output network, the second output network and the third output network. Wherein: the first output network continuously passes the second characteristic diagram through two convolution layers to obtain a hetmap, wherein the hetmap consists of heat diagrams of all traffic light indication direction categories, namely the number of channels is equal to the number of the traffic light indication direction categories. The first output network is based on the heat map of each traffic light indication direction category in the heat map, and judges whether the value of each position in the heat map is not smaller than eight surrounding neighboring positions or not, the operation can be realized through a 3×3 maxpooling operation (similar to the effect of nms in the anchor-based detection), namely, whether the central pixel of the pooling window is the maximum value or not is judged, if yes, the central pixel is taken as a hot spot, if not, the central pixel is skipped and is not taken as the hot spot, therefore, the heat map of each traffic light indication direction category can generate some hot spots (more than C hot spots can be obtained for carrying out subsequent threshold judgment, 100 hot spots are taken in the embodiment), and the hot spots can be the central points of the outer rectangular frame of the signal lamp and need further identification. When the traffic light is further identified, the hot spots in the heat map of each traffic light indication direction category can be screened out by setting a threshold value G, wherein the hot spot position value is smaller than the threshold value, and the remaining hot spots are output as the center point (predicted value) of the traffic light outsourcing rectangular frame in the heat map of the traffic light indication direction category. The threshold G set in this embodiment is 0.3.

In addition, the second output network outputs the offset of the center point of each traffic light outsourcing rectangular frame after continuously passing through the two convolution layers.

In addition, the method comprises the following steps. And the third output network outputs the height and width of each traffic light outsourcing rectangular frame after continuously passing through the two convolution layers.

In this embodiment, for an input picture, I εR ^W*H*3 W represents the width of the picture, H represents the height, and outputs a hotspot graph (keypoint heat map) of the key pointWherein R represents the downsampling multiple, and C represents the traffic light indication direction category number. In the present invention, W, H is 512 pixels, C is 12, and R is 4.

Referring to fig. 4, the parameters of each network layer are as follows: the input picture size of the resnet_50 network is 512×512×3, and the output first feature picture size is 16×16×2048; the size of the second characteristic diagram output after the first characteristic diagram is subjected to continuous three-layer deconvolution is 128×128×64; the size of the hetmap output by the first output network is 128 x C; c represents the number of the traffic light indication direction category labels; the window size of the maximum pooling layer is 3×3; the characteristic size of the offset of the center point of each traffic light outsourcing rectangular frame output by the second output network is 128 x 2; the characteristic size of the height and width of each traffic light outsourcing rectangular frame output by the third output network is 128×128×2. In the traffic signal lamp identification model, 512×512×3 input pictures are extracted to a feature map with a size of 16×16×2048 through a resnet_50 network, a feature map with a size of 128×128×64 is obtained through three deconvolution operations, three networks are respectively used for the feature map, and a key point heat map with a size of 128×128×C, a width and height parameter with a size of 128×128×2 and an offset adjustment parameter with a size of 128×128×2 are obtained.

4. Traffic signal lamp recognition model training

After the traffic signal lamp identification model is built, the traffic signal lamp identification model can be trained by minimizing the total loss function by utilizing the image data set with the labels. In the training process, a reasonable total loss function needs to be set so that the model can be converged.

1) For in the picture sampleThe center point of the target frame is not marked, so that calculation is needed through geometric relation, and the center coordinate is calculated firstlyThe center point coordinates need to be mapped onto the feature map of 128×128 size after downsampling, which can be used to calculate the loss function, so that using equation (1), the labeled target box distributes the key points onto the feature map of 128×128 size 4 times of downsampling in a gaussian kernel manner, and the result is shown in fig. 5. Equation (1) is a gaussian filtering process, and the feature value of each (x, y) position on the mapped feature map is:

wherein sigma _p Is a standard deviation associated with the target size (width and height),represents coordinates at 128 x 128 resolution ∈ ->Is of integer type. If two Gaussian distributions of a certain class overlap, the maximum value is directly taken.

2) Defining the total loss function L before the training process _det As shown in formula (2), three parts are included:

L _det ＝L _k +α _off L _off +α _size L _size (2)

wherein alpha is _off And alpha _size For the weight coefficient, l and 0.1 are set in this embodiment, respectively.

The loss of the first partial target center point in the total loss function is shown as formula (3):

L _k is a loss of the center point of the outsourced rectangle frame, and:

L _n loss for outsourcing rectangular box n:

wherein: n represents the number of outsourcing rectangular frames N of traffic lights in the picture sample,the predicted value of the position of the central point of the traffic light outsourcing rectangular frame, Y, which is obtained by predicting the first output network in the heat map _n The method comprises the steps of (1) predicting the actual value of the position of the central point of the traffic light outsourcing rectangular frame, which is obtained by prediction of a first output network in a heat map, and (2)>And Y _n The value ranges are all 0,1]The value represents the probability that the point belongs to the center point of the traffic light outsourcing rectangular frame, 0 represents that the point is not the center point, and 1 represents that the point is the center point; alpha and beta are super parameters, and 2 and 4 are respectively set in the embodiment.

In equation (3), N is the total number of labeled bounding rectangle boxes N in image I, used to normalize all positive focal loss to 1. The invention modifies Focal Loss, and properly reduces the training proportion, i.e. Loss value, of the key points which are easy to predict.

When (when)At the time of (a)>Acts as a corrective action if +.>Approaching 1, this is a relatively easily detectable point, then +.>The smaller. While->Near 0, this means that this center point has not been learned, so the training weight is increased, and therefore +.>Will be large. Y is Y _n In the case of not equal to 1, the training weight of other neighboring points of the actual center point is adjusted.

For points near the actual center point, Y _n Having a value close to 1, e.g. Y _n =0.9, but the value of this point is predictedA ratio close to 1, which is clearly unobjectionable, should be detected as 0, thus using +.>Punishment is carried out, so that the loss specific gravity is increased; but because this detected point is very close to the actual centre point, the detected +.>Near 1, then use (1-Y) _n ) Beta to sooth and reduce the loss specific gravity.

For points far from the actual center point, Y _n Having a value close to 0, e.g. Y _n =0.1, if the value of this point is predictedIs close to 1, certainly not right, and needs to be added +.>Punishment (same principle as above), if predicted to be close to 0, then use (1-Y) _n ) ^β To make the loss proportion smaller than that of the traditional Chinese medicine; as for (1-Y) _n ) ^β Since the point farther from the center point is predicted at this time, this term causes the greater the loss specific gravity of the point farther from the center point, and the smaller the loss specific gravity of the point nearer, which corresponds to weakening the loss specific gravity of other negative samples around the actual center point, which corresponds to handling the imbalance of the positive and negative samples.

The second part is the outsourced rectangular box center offset loss, as shown in equation (5):

wherein: n is the total number of marked instances in image I,the central point offset estimated value of the traffic light outsourcing rectangular frame n output by the second output network is +.>The actual value of the offset of the central point of the rectangular frame n is wrapped for the traffic light; />The vector space is represented by W and H, which are the width and height of each picture sample, respectively, and R represents the downsampling multiple of the second feature map with respect to the picture sample, in this embodiment r=4.

The calculation formula of the actual value of the central point offset of the traffic light outsourcing rectangular frame n is as follows:

wherein:the center point p of the traffic light outsourcing rectangular frame n in the picture sample passesFloating point coordinates after R times downsampling, center point +.>And the integer coordinate of the center point p of the traffic light outsourcing rectangular frame after R times of downsampling is represented.

Because the image is downsampled by r=4 during training, the remapping of such feature maps onto the original image introduces accuracy errors, and therefore for each center point, additional use is made ofTo compensate it.

The center points of all classes share the same offset prediction, this offset value (ofiset) is defined as L ₁ loss to train. The so-called center point offset (i.e., offset) is actually generated during the mapping process due to floating point type coordinates and shaping coordinates. For an image of 512 x 512 size, if r=4, then the downsampled image is a feature map of 128 x 128 size, and the labeled target boxes are distributed on the feature map in a gaussian kernel manner. The real coordinates in the JSON file are also first converted into a form matching the 128 x 128 size image, but because the real coordinates are in the form of floating point numbers, the calculated center point is also floating point using the conversion. Assume that the center point after transformation is [98.97667,2.3566666 ]]However, in the prediction process, the image size first read in is assumed to be 640×780, then deformed to 512×512, then downsampled to 128×128 by 4 times, and finally predicted to be 128×128, assuming that the predicted coordinates are { x=98, y=2 }, the corresponding category is c, which is equivalent to this pointThere is an object present, and there is an error with the center point coordinates { x=98.97667, y= 2.3566666} of the actual mark, and there is certainly a loss of accuracy when directly mapped onto the 512 x 512 picture, so the offset loss is processed by the formula (4) is introduced.

The third part is the loss of the size (width and height) of the outsourced rectangular box as shown in equation (7):

wherein:is the estimated value s of the height and the width of the traffic light outsourcing rectangular frame n after R times downsampling _n The actual height and width values of the traffic light outsourcing rectangular frame n after R times downsampling (under 128 x 128 low resolution);

assume thatFor object k, the category is c _k Its center point is +.> Width and height of each target k +.>Is the wide-high value after downsampling. Thus, in order to reduce the difficulty of regression, the present invention uses +.>As a predicted value, the loss function is as shown in formula (7).

It should be noted that the aforementioned total loss function L _det The model training is completed by taking the average loss of all samples to minimize the loss of a single sample and obtaining the optimal model parameters through gradient descent.

5. Target prediction and extraction

And inputting a picture to be identified containing traffic lights into the trained traffic light identification model, outputting the center points of the traffic light outsourcing rectangular frames of different traffic light indication direction categories by a first output network, outputting the offset of the center point of each traffic light outsourcing rectangular frame by a second output network, and outputting the height and the width of each traffic light outsourcing rectangular frame by a third output network.

Finally, in the Heatmap, according to the central point, the offset, the height and the width of the output traffic light outsourcing rectangular frame, a calibration frame of the traffic light outsourcing rectangular frame is obtained, and the calibration frame is remapped to the picture to be identified.

Wherein, it is assumed thatFor the detected point, representing a detected point in class c, a calculation formula of a calibration frame of the traffic light outsourcing rectangular frame is shown as a formula (7) according to the center point, the offset and the height and the width of the output traffic light outsourcing rectangular frame in the hetmap:

wherein: (x) ₁ ，y ₁ )、(x ₂ ，y ₂ ) The upper left corner and the lower right corner coordinates (all integer coordinates) of the calibration frame are respectively represented,is to output the central point of a traffic light outsourcing rectangular frame in the hetmap by the first output network>Coordinates of->Is the central point of the traffic light outsourcing rectangular frame output by the second output network>Offset of ∈10)> The height and the width of the traffic light outsourcing rectangular frame X output by the third output network.

The calibration frame is a calibration coordinate in the hematmap, and the calibration frame is required to be mapped back to the original picture according to the original downsampling multiple, so that the traffic signal lamp calibration frame in the picture to be identified is obtained.

In this example, to show the effect of the method described above, it was applied to the same validation set together with Yolo3, and the prediction result indicated that Yolo3 was 87.67% accurate, while the present invention was 95.48%. In addition, the invention can obtain higher prediction accuracy under different weather conditions, wherein fig. 6 is a prediction result diagram under a rainy day environment, fig. 7 is a prediction result diagram under a night exposure environment, and fig. 8 is a prediction result diagram under a good daytime environment.

In addition, in another embodiment, based on the above identification method, there may be further provided an identification device of traffic lights based on a hectmap extraction key point, which includes a memory and a processor;

the memory is used for storing a computer program;

the processor is used for realizing the identification method of the traffic signal lamp based on the extraction key points of the hematmap when executing the computer program.

In addition, in another embodiment, based on the above identification method, a computer readable storage medium may be further provided, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the above identification method for a traffic signal based on the hectmap extraction key point is implemented.

It should be noted that the Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one magnetic disk Memory. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Of course, the apparatus should also have necessary components to implement the program operation, such as a power supply, a communication bus, and the like.

The above embodiment is only a preferred embodiment of the present invention, but it is not intended to limit the present invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, all the technical schemes obtained by adopting the equivalent substitution or equivalent transformation are within the protection scope of the invention.

Claims

1. The identifying method of the traffic signal lamp based on the key point extraction of the hetmap is characterized by comprising the following steps:

s2, constructing a traffic signal lamp identification model by taking a resnet_50 network as a skeleton network, outputting a second feature map after continuous three-layer deconvolution of a first feature map output by the resnet_50 network, and respectively inputting the second feature map into a first output network, a second output network and a third output network; the first output network continuously passes the second feature map through two convolution layers to obtain a heat map, the heat map consists of heat maps of all traffic light indication direction categories, the heat map of each traffic light indication direction category in the heat map is extracted, whether the value of each position in the heat map is not smaller than eight surrounding neighboring positions is judged, and if the value of each position in the heat map is not smaller than the value of each neighboring position, the heat map is used as a hot spot; aiming at the hot spots in the heat map of each traffic light indication direction category, screening out the hot spots with the hot spot position value smaller than the threshold value by setting the threshold value, and outputting the rest hot spots as the center points of the rectangle frame of the traffic light outer package in the heat map of the traffic light indication direction category; the second output network outputs the offset of the center point of each traffic light outsourcing rectangular frame after continuously passing through the two convolution layers; the third output network outputs the height and width of each traffic light outsourcing rectangular frame after continuously passing through the second feature map through the two convolution layers;

L _det ＝L _k +α _off L _off +α _size L _size

wherein: alpha _off And alpha _size Is a weight coefficient;

L _k is a loss of the center point of the outsourced rectangle frame, and:

L _n loss for outsourcing rectangular box n:

wherein: n represents the number of outsourcing rectangular frames N of traffic lights in the picture sample,the predicted value of the position of the central point of the traffic light outsourcing rectangular frame, Y, which is obtained by predicting the first output network in the heat map _n The method comprises the steps of (1) predicting the actual value of the position of the central point of the traffic light outsourcing rectangular frame, which is obtained by prediction of a first output network in a heat map, and (2)>And Y _n The value ranges are all 0,1]The value represents that the point belongs to a traffic lightThe probability of outsourcing a rectangular frame center point, 0 indicating that the point is not the center point, 1 indicating that the point is the center point; alpha, beta are hyper-parameters;

L _off for outsourcing rectangular frame center point offset loss, and:

L _size for outsourcing rectangular box size loss, and:

2. The method for identifying traffic light based on the extraction of key points from a hematmap according to claim 1, wherein in S1, the picture sample is a picture including a traffic light captured by a vehicle recorder or a picture including a traffic light captured by an intersection monitoring camera during driving of an automobile.

3. The method for identifying traffic lights based on the extraction of key points from a hetmap as claimed in claim 1, wherein in S1, the labeling content of each picture sample is (c, x) ₁ ，y ₁ ，x ₂ ，y ₂ ) C represents the traffic light indication direction category, (x) ₁ ，y ₁ ) And (x) ₂ ，y ₂ ) And respectively representing the left upper corner coordinate and the right lower corner coordinate of the traffic light outsourcing rectangular frame.

4. The method for identifying traffic lights based on the extraction of key points from a hetmap according to claim 1, wherein in S2, parameters of each network layer in the traffic light identification model are as follows:

5. The method for identifying traffic signals based on the extraction of key points from the hetmap as claimed in claim 1, wherein in S3, the calculation formula of the actual value of the central point offset of the rectangular frame n of the traffic light outsourcing is:

wherein:the floating point coordinate of the center point p of the traffic light outsourcing rectangular frame n in the picture sample after R times downsampling is represented, and the center point +.>Center point p for representing traffic light outsourcing rectangular frame

Integer coordinates after R-times downsampling.

6. The method for identifying traffic lights based on the extraction of key points from the hetmap as claimed in claim l, wherein in S4, the calibration frame of the traffic light outsourcing rectangular frame in the hetmap is:

wherein: (x) ₁ ，y ₁ )、(x ₂ ，y ₂ ) Indicating the upper left and lower right corner coordinates of the calibration frame respectively,the first output network outputs the X coordinate of the central point of a traffic light outsourcing rectangular frame in the hetmap,/for>Is the offset of the central point X of the traffic light outsourcing rectangular frame output by the second output network, < + >>The height and the width of the traffic light outsourcing rectangular frame X output by the third output network.

7. The method for identifying traffic lights based on a hectmap extraction key point as claimed in claim 1, wherein the weight coefficient α _off And alpha _size Set to 1 and 0.1, respectively.

8. The method for identifying traffic lights based on the extraction of key points from a hetmap according to claim 1, wherein the super parameters α, β are set to 2 and 4, respectively.

9. The identifying device of the traffic signal lamp based on the key point extracted by the hematmap is characterized by comprising a memory and a processor;

the memory is used for storing a computer program;

the processor is configured to implement the method for identifying traffic lights based on the extraction of key points by the hemmap according to any one of claims 1 to 8 when the computer program is executed.

10. A computer-readable storage medium, wherein the storage medium has stored thereon a computer program which, when executed by a processor, implements the method for identifying traffic signals based on the extraction of key points by the hetmap according to any one of claims 1 to 8.