CN112215188A

CN112215188A - Traffic police gesture recognition method, device, equipment and storage medium

Info

Publication number: CN112215188A
Application number: CN202011132576.5A
Authority: CN
Inventors: 吴晓东
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2021-01-12

Abstract

The invention discloses a traffic police gesture recognition method, a device, equipment and a storage medium, wherein the method comprises the following steps: collecting images under various traffic environments and preprocessing the images; clustering is carried out on the preprocessed images based on a kmeans + + clustering algorithm to generate a plurality of anchor frames; constructing a YOLOv3 deep learning neural network model, and training the YOLOv3 deep learning neural network model by adopting a cross entropy loss function and an EIOU loss function to obtain an EIOU-YOLOv3 deep learning neural network model; performing traffic police attitude feature extraction and detection on the preprocessed image by using an EIOU-YOLOv3 deep learning neural network model to obtain a plurality of feature maps with different scales; extracting and identifying the posture features of the traffic police on the feature map by using an anchor frame to obtain a prediction frame; and removing redundant prediction frames by adopting a Soft-NMS algorithm to obtain a target prediction frame and an identification result. Through the mode, the overall accuracy and recall rate of the traffic police gesture recognition can be effectively improved.

Description

Traffic police gesture recognition method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a traffic police gesture recognition method, a traffic police gesture recognition device, traffic police gesture recognition equipment and a storage medium.

Background

The traffic police is a national public officer who maintains the traffic order of urban roads and ensures the smooth and safe traffic transportation, and plays an important role in the urban road traffic safety. The automatic detection and identification of the traffic police gesture become one of important links in an intelligent traffic safety monitoring system. The commonly used road on-road warning posture detection method mainly comprises a method for establishing a deformation part model and combining a classifier based on the characteristics of LBP, Haar, HOG and the like and a method based on deep learning. The identification method based on the deformation part model needs to establish local models of a plurality of pedestrians, and is large in calculation amount and poor in robustness in a complex road environment. The deep learning-based method can effectively extract the implicit characteristics of the essence of data by using the convolutional neural network, shares the weight, and has better robustness and recognition accuracy on the traffic police posture in the road environment. Among them, the deep learning method based on YOLOv3 becomes one of the popular traffic police gesture recognition algorithms in the industry at present due to its fast detection speed.

The deep learning method based on the YOLOv3 has high accuracy rate of identifying targets in simple scenes such as sunny days and daytime, but has relatively low accuracy rate and recall rate in difficult scenes such as haze, rainy days and nights, and still has a large promotion space.

Disclosure of Invention

The invention provides a method, a device, equipment and a storage medium for recognizing the posture of a traffic police, which can effectively improve the overall accuracy and recall rate of the posture recognition of the traffic police.

In order to solve the technical problems, the invention adopts a technical scheme that: a traffic police gesture recognition method is provided, which comprises the following steps:

collecting images under various traffic environments and preprocessing the images;

clustering is carried out on the preprocessed images based on a kmeans + + clustering algorithm to generate a plurality of anchor frames;

constructing a YOLOv3 deep learning neural network model, and training the YOLOv3 deep learning neural network model by adopting a cross entropy loss function and an EIOU loss function to obtain an EIOU-YOLOv3 deep learning neural network model;

performing traffic police attitude feature extraction and detection on the preprocessed image by using the EIOU-YOLOv3 deep learning neural network model to obtain a plurality of feature maps with different scales;

extracting and identifying the traffic police attitude feature on the feature map by using the anchor frame to obtain a prediction frame;

and removing the redundant prediction box by adopting a Soft-NMS algorithm to obtain a target prediction box and an identification result.

According to one embodiment of the invention, the step of acquiring images in various traffic environments and preprocessing the images comprises the following steps:

collecting image sets under various traffic environments in real time;

selecting an image to be identified from the image set;

marking the traffic police in the image to be identified by using a marking tool to obtain a marking frame;

and randomly dividing the image to be identified after the labeling processing into a training set and a testing set according to a preset proportion.

According to an embodiment of the present invention, the step of clustering the preprocessed images based on the kmeans + + clustering algorithm to generate a plurality of anchor frames includes:

randomly selecting one marking frame from the training set as an initial clustering center;

calculating the distance between each labeling frame and the initial clustering center according to a preset distance formula, and selecting the next clustering center according to the distance calculation result;

and taking the next clustering center as an initial clustering center, repeatedly executing the step of calculating the distance between each marking frame and the initial clustering center according to a preset distance formula, and selecting the next clustering center according to the distance calculation result until nine initial clustering centers are selected, and taking the initial clustering centers as anchor frames.

According to an embodiment of the invention, the step of constructing a YOLOv3 deep learning neural network model and training the YOLOv3 deep learning neural network model by using a cross entropy loss function and an EIOU loss function to obtain an EIOU-YOLOv3 deep learning neural network model includes:

constructing a YOLOv3 deep learning neural network model, and improving a residual error connection structure of the YOLOv3 deep learning neural network model into three times of weighted summation processing from two times of splicing;

calculating the sum of the cross entropy loss function and the EIOU loss function to obtain a total loss function;

and training the improved Yolov3 deep learning neural network model by adopting the total loss function to obtain an EIOU-Yolov3 deep learning neural network model.

According to an embodiment of the invention, the step of extracting and detecting the traffic police pose features of the preprocessed image by using the EIOU-YOLOv3 deep learning neural network model to obtain a plurality of feature maps with different scales includes:

converting the size of the preprocessed image into a preset size;

performing traffic police attitude feature extraction on the image after the size conversion by adopting a DarkNet53 network;

and performing up-sampling processing, weighted summation processing and multiple convolution processing on the extracted result of the attitude feature of the traffic police to obtain a plurality of feature maps with different scales.

According to an embodiment of the present invention, the step of performing upsampling, weighted summation and multiple convolution on the extracted result of the traffic police pose feature to obtain a plurality of feature maps with different scales includes:

performing upsampling processing on a first feature matrix output by the DarkNet53 network to obtain a second feature matrix;

performing first weighted summation processing and multiple convolution processing on the first feature matrix and the second feature matrix to obtain a first scale feature map;

performing upsampling processing on the first weighted summation processing result to obtain a third feature matrix;

performing second weighted summation processing and multiple convolution processing on the first feature matrix and the third feature matrix to obtain a second scale feature map;

performing upsampling processing on the second weighted summation processing result to obtain a fourth feature matrix;

and performing third weighted summation processing and multiple convolution processing on the first feature matrix and the fourth feature matrix to obtain a third scale feature map.

According to an embodiment of the present invention, the step of removing the redundant prediction box by using Soft-NMS algorithm to obtain the target prediction box and the recognition result comprises:

calculating a score according to the confidence degree of the prediction frame, and selecting the prediction frame with the highest confidence degree score in all the prediction frames;

traversing the rest of the prediction boxes, and calculating the IOU values of the current prediction box and the prediction box with the highest confidence score;

comparing the IOU value with an IOU preset threshold value, and updating the confidence score of the current prediction frame according to the comparison result;

and comparing the updated confidence score of each prediction frame with a confidence score threshold, and reserving and determining the prediction frame higher than the confidence score threshold as a target prediction frame.

In order to solve the technical problem, the invention adopts another technical scheme that: provided is a traffic police gesture recognition device, including:

the acquisition and preprocessing module is used for acquiring images under various traffic environments and preprocessing the images;

the clustering module is used for clustering on the preprocessed images based on a kmeans + + clustering algorithm to generate a plurality of anchor frames;

the building and training module is used for building a YOLOv3 deep learning neural network model and training the YOLOv3 deep learning neural network model by adopting a cross entropy loss function and an EIOU loss function to obtain an EIOU-YOLOv3 deep learning neural network model;

the characteristic extraction and detection module is used for extracting and detecting the traffic police attitude characteristic of the preprocessed image by utilizing the EIOU-YOLOv3 deep learning neural network model to obtain a plurality of characteristic graphs with different scales;

the prediction module is used for extracting and identifying the traffic police attitude feature on the feature map by utilizing the anchor frame to obtain a prediction frame;

and the screening module is used for removing the redundant prediction box by adopting a Soft-NMS algorithm to obtain a target prediction box and an identification result.

In order to solve the technical problems, the invention adopts another technical scheme that: there is provided a computer device comprising a memory and a processor connected to the memory, wherein the memory stores a computer program operable on the processor, and the processor implements the traffic police gesture recognition method when executing the computer program.

In order to solve the technical problems, the invention adopts another technical scheme that: there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a program file of the above-described traffic alarm gesture recognition method.

The invention has the beneficial effects that: the traditional kmeans clustering algorithm is improved into the kmeans + + clustering algorithm, the quality of anchor frame generation is improved, the positioning precision of the traffic police detection frame is improved, the traditional regression loss is improved into EIOU loss from square loss when a YOLOv3 deep learning neural network model is trained, the regression precision of the coordinates of the prediction frame is greatly improved, and meanwhile, the NMS algorithm for removing the weight of the prediction frame is improved into the Soft-NMS algorithm, so that the problems of missed recall and false detection of the traffic police detection frame are effectively relieved, and the overall accuracy and recall rate of the traffic police gesture recognition are effectively improved.

Drawings

Fig. 1 is a flow chart illustrating a method for recognizing a traffic police gesture according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating step S101 according to the first embodiment of the present invention;

FIG. 3 is a flowchart illustrating step S102 according to the first embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an EIOU-YOLOv3 deep learning neural network model according to a first embodiment of the present invention;

FIG. 5 is a distribution diagram of the prediction box, the label box, and the minimum bounding matrix enclosing both the candidate box and the label box according to the embodiment of the present invention;

FIG. 6 is a flowchart illustrating step S106 according to the first embodiment of the present invention;

FIG. 7 is a flow chart of a method for recognizing a traffic police gesture according to a second embodiment of the present invention;

FIG. 8 is a schematic structural diagram of an EIOU-YOLOv3 deep learning neural network model according to a second embodiment of the present invention;

FIG. 9 is a flowchart illustrating step S706 according to the second embodiment of the present invention;

fig. 10 is a schematic structural diagram of a traffic police gesture recognition apparatus according to an embodiment of the present invention;

FIG. 11 is a schematic structural diagram of a computer device according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first", "second" and "third" in the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise. All directional indicators (such as up, down, left, right, front, and rear … …) in the embodiments of the present invention are only used to explain the relative positional relationship between the components, the movement, and the like in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicator is changed accordingly. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Fig. 1 is a flowchart illustrating a traffic police gesture recognition method according to a first embodiment of the present invention. It should be noted that the method of the present invention is not limited to the flow sequence shown in fig. 1 if the results are substantially the same. As shown in fig. 1, the method comprises the steps of:

step S101: and collecting images in various traffic environments and preprocessing the images.

In step S101, the traffic environment refers to traffic conditions in scenes such as sunny days, daytime, haze, rainy days, and nighttime, which include the traffic police directing traffic on the road and maintaining traffic order, and therefore, the image should at least include posture features of the traffic police. The preprocessing of the images includes performing targeted screening on the images in different scenes, for example, the method for recognizing the posture of the traffic police mainly recognizes the images in the scenes such as haze, rainy days, nights and the like, and the images in the scenes are screened in a targeted manner. The preprocessing of the image also includes labeling the traffic police in the image and classifying the image. In other preferred embodiments, the image preprocessing further includes denoising, sharpening, and screening the image quality.

Further, referring to fig. 2, step S101 further includes the following steps:

step S201: and collecting image sets under various traffic environments in real time.

Step S202: and selecting an image to be identified from the image set.

In step S202, the image set is screened, and images in the scenes of haze, rain and night are selected.

Step S203: and (4) carrying out labeling processing on the traffic police in the image to be recognized by using a labeling tool to obtain a labeling frame.

In step S203, a rectangular labeling frame is labeled on the positions of all traffic police in the image to be recognized by using a labeling tool.

Step S204: and randomly dividing the image to be identified after the labeling processing into a training set and a testing set according to a preset proportion.

In step S204, the training set is used for clustering and model training in the subsequent steps, and the test set is used for traffic police pose feature extraction and detection in the subsequent steps.

Step S102: clustering is carried out on the preprocessed images based on a kmeans + + clustering algorithm to generate a plurality of anchor frames.

In step S102, compared with the conventional kmeans clustering algorithm, the kmeans + + clustering algorithm of the present embodiment selects an initial clustering center in a wheel disk manner, so that the quality of anchor frame generation is improved, and the positioning accuracy of the traffic police detection frame is improved, thereby improving the overall accuracy and recall rate of the traffic police gesture recognition.

Further, referring to fig. 3, step S102 further includes the following steps:

step S301: and randomly selecting a marking frame from the training set as an initial clustering center.

Step S302: and calculating the distance between each marking frame and the initial clustering center according to a preset distance formula, and selecting the next clustering center according to the distance calculation result.

In step S302, a preset distance is setThe formula is Dis is 1-IOU, wherein Dis is the distance between the labeling frame and the initial clustering center, IOU is the intersection ratio between the labeling frame and the initial clustering center,

and I is the area of the intersection of the labeling frame and the initial clustering center, and U is the area of the union of the labeling frame and the initial clustering center. In this embodiment, when the IOU value is not lower than 0.5, the label box is used as the next clustering center.

Step S303: and taking the next clustering center as an initial clustering center and repeatedly executing the step S302 until nine initial clustering centers are selected, and taking the initial clustering centers as anchor frames.

Step S103: and constructing a YOLOv3 deep learning neural network model, and training the YOLOv3 deep learning neural network model by adopting a cross entropy loss function and an EIOU loss function to obtain an EIOU-YOLOv3 deep learning neural network model.

In step S103, referring to fig. 4, the EIOU-yollov 3 deep learning neural network model includes an input module 41, a feature extraction module 42 connected to the input module 41, a first generation module 43, a second generation module 44, and a third generation module 45, the first generation module 43, the second generation module 44, and the third generation module 45 are sequentially connected and are all connected to the feature extraction module 43, and the first generation module 43, the second generation module 44, and the third generation module 45 respectively output three feature maps with different scales. When the Yolov3 deep learning neural network model is trained, the regression loss is improved from the square loss to the EIOU loss, and the regression precision of the coordinates of the prediction frame is greatly improved, so that the overall accuracy and the recall rate of the posture recognition of the traffic police are improved.

Specifically, a total loss function for training the YOLOv3 deep learning neural network model is composed of a regression loss and a classification loss, and the total loss function is calculated according to the following formula: loss is loss_reg+loss_clsWherein, loss is the total loss function, loss_regLoss for regression_clsIs a classification loss. Further, loss_regCalculated according to the following formula:

wherein the content of the first and second substances,

IOU is the intersection ratio of the prediction frame and the labeling frame, I represents the area of the intersection between the prediction frame and the labeling frame, U represents the area of the union between the prediction frame and the labeling frame, alpha is an attenuation coefficient, the value range of alpha is 0.5-1, preferably, alpha is 0.9, d represents the distance between the center points of the prediction frame and the labeling frame, C represents the length of a diagonal line of a minimum circumscribed matrix surrounding the preselected frame and the labeling frame at the same time, as shown in FIG. 5, P represents the prediction frame, T represents the labeling frame, and C represents the minimum circumscribed matrix surrounding the candidate frame and the labeling frame at the same time.

Step S104: and performing traffic police attitude feature extraction and detection on the preprocessed image by using an EIOU-YOLOv3 deep learning neural network model to obtain a plurality of feature maps with different scales.

In step S104, the size of the preprocessed image is converted into a preset size and then input into a DarkNet53 network, a DarkNet53 network is adopted to perform traffic police gesture feature extraction on the size-converted image, and a plurality of feature maps with different scales are obtained based on the traffic police gesture feature extraction result.

The size of the preprocessed image can be any size, and for an image with any size of P × Q, before the image is input, the size of the image is adjusted, the size of the image is scaled to a preset size of M × N, and the aspect ratio of the image is ensured to be unchanged during size adjustment. The DarkNet53 network comprises 52 convolutional layers and 1 full-connection layer, wherein the 52 convolutional layers are used for carrying out traffic alarm posture feature extraction on an input image, the full-connection layer is used for outputting a feature map matrix, and feature maps with different scales are obtained based on an output result of the DarkNet53 network.

Step S105: and extracting and identifying the posture features of the traffic police on the feature map by using the anchor frame to obtain a prediction frame.

In step S105, nine anchor frames respectively detect and identify the traffic police postures on the obtained three feature maps with different scales, and each feature map respectively predicts the coordinates (i.e., the traffic police coordinates) and the categories (i.e., whether the traffic police is the police) of the three different anchor frames, and the prediction result is the prediction frame. The prediction box information includes prediction box coordinates and a confidence.

Step S106: and removing redundant prediction frames by adopting a Soft-NMS algorithm to obtain a target prediction frame and an identification result.

In step S106, a Soft-NMS algorithm is used to perform non-maximum suppression operation, when two targets are close and the intersection ratio of the prediction frames is greater than or equal to a preset threshold, the score of the prediction frame with the lower confidence score in the two prediction frames is reduced, so that the prediction frame with the reduced score is retained in the sorted list for secondary screening, and finally the prediction frame with the score higher than the confidence score threshold is determined as the target prediction frame.

Further, referring to fig. 6, step S106 further includes the following steps:

step S601: and calculating a score according to the confidence degree of the prediction frames, and selecting the prediction frame with the highest confidence degree score in all the prediction frames.

Step S602: and traversing the rest of the prediction boxes, and calculating the IOU values of the current prediction box and the prediction box with the highest confidence score.

In step S602, IOU is equal to I/U, I indicates the area of the intersection of the current prediction frame and the prediction frame with the highest confidence score, and U indicates the area of the union of the current prediction frame and the prediction frame with the highest confidence score.

Step S603: and comparing the IOU value with a preset threshold of the IOU, and updating the confidence score of the current prediction frame according to the comparison result.

In step S603, when the IOU is smaller than the IOU threshold, the current prediction box and its confidence score are retained, and when the IOU is greater than or equal to the IOU threshold, the confidence score of the current prediction box is updated. More specifically, the step of updating the confidence score of the current prediction box according to the comparison result is performed according to the following formula:

the IOU represents the intersection ratio of the current prediction box and the prediction box with the highest confidence score; IOU_thresholdIndicating an IOU threshold, preferably 0.5, score indicates the confidence score of the current prediction box. In this step, steps S601-S603 need to be repeatedly executed until the confidence scores of all the prediction boxes are updated, where when step S601 is repeatedly executed, all the prediction boxes refer to the current prediction box and the set of confidence scores thereof that are kept when the IOU is smaller than the IOU threshold.

Step S604: and comparing the updated confidence score of each prediction frame with a confidence score threshold, and reserving and determining the prediction frame higher than the confidence score threshold as the target prediction frame.

In step S604, the confidence score threshold is preferably 0.45, and the prediction frame after updating the confidence score is further filtered, so that the overall accuracy and recall rate of the traffic police gesture recognition can be further improved.

According to the traffic police gesture recognition method, the traditional kmeans clustering algorithm is improved into the kmeans + + clustering algorithm, the quality of anchor frame generation is improved, the positioning precision of the traffic police detection frame is improved, when a YOLOv3 deep learning neural network model is trained, the traditional regression loss is improved from square loss to EIOU loss, the regression precision of the coordinates of the prediction frame is greatly improved, meanwhile, the NMS algorithm for removing the weight of the prediction frame is improved into the Soft-NMS algorithm, the problems of missed call and false detection of the traffic police detection frame are effectively relieved, and therefore the overall accuracy and the recall rate of the traffic police gesture recognition are effectively improved.

Fig. 7 is a flowchart illustrating a traffic police gesture recognition method according to a second embodiment of the present invention. It should be noted that the method of the present invention is not limited to the flow sequence shown in fig. 7 if the results are substantially the same. As shown in fig. 7, the method includes the steps of:

step S701: and collecting images in various traffic environments and preprocessing the images.

In this embodiment, step S701 in fig. 7 is similar to step S101 in fig. 1, and for brevity, is not described herein again.

Step S702: clustering is carried out on the preprocessed images based on a kmeans + + clustering algorithm to generate a plurality of anchor frames.

In this embodiment, step S702 in fig. 7 is similar to step S102 in fig. 1, and for brevity, is not repeated herein.

Step S703: and constructing a YOLOv3 deep learning neural network model, and improving the residual connecting structure of the YOLOv3 deep learning neural network model into triple weighted summation processing from two times of splicing.

In step S703, the residual connection structure of the YOLOv3 deep learning neural network model is improved from two times of splicing to three times of weighted summation processing, so that information loss is effectively reduced, and the integrity of extracted features is improved, thereby improving the overall accuracy and recall rate of the traffic police gesture recognition.

Step S704: and calculating the sum of the cross entropy loss function and the EIOU loss function to obtain a total loss function.

In step S704, the total loss function is composed of two parts, i.e., a regression loss and a classification loss, and specifically, loss ═ loss_reg+loss_clsWherein, loss is the total loss function, loss_regLoss for regression_clsIs a classification loss. Further, loss_regCalculated according to the following formula:

wherein the content of the first and second substances,

the IOU is the intersection ratio of the prediction frame and the labeling frame, I represents the area of the intersection between the prediction frame and the labeling frame, U represents the area of the union between the prediction frame and the labeling frame, alpha is an attenuation coefficient, the value range of alpha is 0.5-1, preferably, alpha is 0.9, d represents the distance between the center points of the prediction frame and the labeling frame, and c represents the length of a diagonal line of a minimum circumscribed matrix surrounding the preselected frame and the labeling frame at the same time, as shown in FIG. 5.

Step S705: and training the improved Yolov3 deep learning neural network model by adopting a total loss function to obtain an EIOU-Yolov3 deep learning neural network model.

In step S705, referring to fig. 8, the EIOU-yollov 3 deep learning neural network model includes an input module 81, a feature extraction module 82 connected to the input module 81, a first generation module 83, a second generation module 84, and a third generation module 85, where the first generation module 83, the second generation module 84, and the third generation module 85 are sequentially connected and are all connected to the feature extraction module 82. Wherein each generation module comprises a plurality of convolutional layers, an upsampling layer, a weighted summation layer, and an output layer. Specifically, the first generation module 83 includes a first convolution layer 831, a first upsampling layer 832, a first weighted sum layer 833, a second convolution layer 834, and a first output layer 835 connected in sequence; the second generating module 84 includes a third convolutional layer 841, a second upsampling layer 842, a second weighted-sum layer 843, a fourth convolutional layer 844, and a second output layer 845 which are connected in sequence; the third generation module 85 includes a fifth convolution layer 851, a third upsampling layer 852, a third weighted sum layer 853, a sixth convolution layer 854, and a third output layer 855 that are sequentially connected, the first weighted sum layer 833, the second weighted sum layer 843, and the third weighted sum layer 853 are further respectively connected to the feature extraction module 82, the second convolution layer 834 is connected to the third convolution layer 841, the fourth convolution layer 844 is connected to the fifth convolution layer 851, and the first output layer 835, the second output layer 845, and the third output layer 855 output feature maps of different scales respectively.

Step S706: and performing traffic police attitude feature extraction and detection on the preprocessed image by using an EIOU-YOLOv3 deep learning neural network model to obtain a plurality of feature maps with different scales.

In step S706, please refer to fig. 9, which further includes the following steps:

step S901: and converting the size of the preprocessed image into a preset size.

In step S901, the size of the preprocessed image may be any size, and for an image with any size P × Q, before the image is input, the size of the image is adjusted, and the image size is scaled to a preset size M × N, so that the aspect ratio of the image is ensured to be unchanged during the size adjustment.

Step S902: and performing traffic police posture feature extraction on the image after the size conversion by adopting a DarkNet53 network.

In step S902, the DarkNet53 network includes 52 convolutional layers and 1 fully-connected layer, where the 52 convolutional layers are used to perform traffic alarm posture feature extraction on the image to obtain a feature map matrix, and the fully-connected layer is used to output the feature map matrix.

Step S903: and performing up-sampling processing, weighted summation processing and multiple convolution processing on the extracted result of the attitude feature of the traffic police to obtain a plurality of feature maps with different scales.

In step S903, the up-sampling process doubles the size of the extracted result of the traffic police gesture feature, and the number of channels remains unchanged. The weighted sum process requires that the size of the two input matrices be consistent, and the weighted sum process does not change the size of the matrices and the number of channels. And multiple convolutions are used for further extracting features, so that the feature precision is improved. In the process of weighted summation, the up-sampled output matrix is weighted and summed with the corresponding Block in the DarkNet53 network. Block refers to a characteristic diagram matrix of a certain middle layer of the DarkNet53 network, and is mainly used for keeping the same size with the matrix after upsampling, otherwise, weighted summation operation cannot be carried out. For example: assuming that the size of the output matrix after upsampling (i.e. one of the inputs of the weighted summation process) in the graph is 26 × 128, and assuming that the sizes of the output matrices of the 120 th layer, the 130 th layer and the 140 th layer in the darktnet 53 network are 13 × 128, 26 × 128 and 26 × 256, respectively, the other input of the weighted summation process can only be the output matrix of the 130 th layer (i.e. the feature map matrices, which must have the same size and are all 26 × 128), but not the feature map matrices of the 120 th layer and the 140 th layer.

In addition, in the weighted summation process, the weight of upsampling is preferably 0.6, the weight of the feature extraction result is preferably 0.4, and in other embodiments, other weights may be configured.

Specifically, first, a first feature matrix output by a DarkNet53 network is subjected to upsampling processing to obtain a second feature matrix; then, carrying out first weighted summation processing and multiple convolution processing on the first characteristic matrix and the second characteristic matrix to obtain a first scale characteristic diagram; then, performing upsampling processing on the first weighted summation processing result to obtain a third feature matrix; then, carrying out second weighted summation processing and multiple convolution processing on the first characteristic matrix and the third characteristic matrix to obtain a second scale characteristic diagram; then, performing upsampling processing on the result of the second weighted summation processing to obtain a fourth feature matrix; and finally, carrying out third weighted summation processing and multiple convolution processing on the first feature matrix and the fourth feature matrix to obtain a third scale feature map.

More specifically, as shown in fig. 8, the first feature matrix output by the darktnet 53 network performs CBL operation (3 × 3 convolution + batch normalization + leak Relu activation function), outputs a matrix of 3 × 10 (size 3 × 3, number of channels 10), then performs Upsample operation, outputs a matrix with size enlarged by 2 times, that is, a matrix of 6 × 6 10, then selects a matrix in the darktnet 53 network that is consistent with the size of the upsampled matrix and the upsampled output matrix to perform Sum (weighted Sum calculation), outputs a matrix with unchanged size and number of channels, performs CBL processing on the matrix for 5 times, further extracts features, and then performs Conv (1 convolution) to obtain a first scale feature map y 1. The detailed process flow of the steps for generating the second scale feature map y2 and the third scale feature map y3 is similar to the process flow of the steps for generating the first scale feature map, and it should be noted that in the process of generating the second scale feature map, the input of the upsampling is the weighted sum calculation result for generating the first scale feature map, and in the process of generating the third scale feature map, the input of the upsampling is the weighted sum calculation result for generating the second scale feature map, and the rest steps are the same and are not repeated herein. In this embodiment, the first, second and third scale feature maps y1, y2, y3 are 6 × 10, 12 × 10, 24 × 10, respectively.

Step S707: and extracting and identifying the posture features of the traffic police on the feature map by using the anchor frame to obtain a prediction frame.

In this embodiment, step S707 in fig. 7 is similar to step S105 in fig. 1, and for brevity, is not described herein again.

Step S708: and removing redundant prediction frames by adopting a Soft-NMS algorithm to obtain a target prediction frame and an identification result.

In this embodiment, step S708 in fig. 7 is similar to step S106 in fig. 1, and for brevity, is not described herein again.

On the basis of the first embodiment, the traffic police gesture recognition method of the second embodiment of the invention improves the original residual error connection structure from two times of splicing to three times of weighted summation processing, thereby effectively reducing information loss, improving the integrity of extracted features and further improving the overall accuracy and recall rate of traffic police gesture recognition.

Fig. 10 is a schematic structural diagram of a traffic police gesture recognition apparatus according to an embodiment of the present invention. As shown in fig. 10, the apparatus 100 includes an acquisition and preprocessing module 101, a clustering module 102, a construction and training module 103, a feature extraction and detection module 104, a prediction module 105, and a screening module 106.

The acquisition and preprocessing module 101 is used for acquiring images in various traffic environments and preprocessing the images;

the clustering module 102 is configured to perform clustering on the preprocessed images based on a kmeans + + clustering algorithm to generate a plurality of anchor frames;

the building and training module 103 is used for building a Yolov3 deep learning neural network model, training the Yolov3 deep learning neural network model by adopting a cross entropy loss function and an EIOU loss function, and obtaining an EIOU-Yolov3 deep learning neural network model;

the feature extraction and detection module 104 is configured to perform traffic police posture feature extraction and detection on the preprocessed image by using the EIOU-YOLOv3 deep learning neural network model to obtain a plurality of feature maps with different scales;

the prediction module 105 is used for extracting and identifying the traffic police attitude feature on the feature map by using the anchor frame to obtain a prediction frame;

and the screening module 106 is used for removing redundant prediction frames by adopting a Soft-NMS algorithm to obtain target prediction frames and identification results.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 11, the computer device 110 includes a processor 111 and a memory 112 coupled to the processor 111.

The memory 112 stores program instructions for implementing the method of traffic police gesture recognition as described in any of the above embodiments.

Processor 111 is operative to execute program instructions stored in memory 112 to identify a traffic alert gesture.

The processor 111 may also be referred to as a Central Processing Unit (CPU). The processor 111 may be an integrated circuit chip having signal processing capabilities. Processor 111 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention. The computer readable storage medium of the embodiment of the present invention stores a program file 121 capable of implementing all the methods described above, where the program file 121 may be stored in the computer readable storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned computer-readable storage media comprise: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A traffic police gesture recognition method is characterized by comprising the following steps:

2. The method of claim 1, wherein the step of collecting and pre-processing images of multiple traffic environments comprises:

collecting image sets under various traffic environments in real time;

selecting an image to be identified from the image set;

3. The method according to claim 2, wherein the step of clustering on the preprocessed image based on a kmeans + + clustering algorithm to generate a plurality of anchor boxes comprises:

4. The method for recognizing the traffic police gesture according to claim 1, wherein the step of constructing a YOLOv3 deep learning neural network model and training the YOLOv3 deep learning neural network model by using a cross entropy loss function and an EIOU loss function to obtain the EIOU-YOLOv3 deep learning neural network model comprises:

5. The method of claim 4, wherein the step of performing traffic police pose feature extraction and detection on the preprocessed image by using the EIOU-YOLOv3 deep learning neural network model to obtain a plurality of feature maps with different scales comprises:

converting the size of the preprocessed image into a preset size;

6. The method according to claim 5, wherein the step of performing up-sampling, weighted summation, and multiple convolution on the extracted features of the traffic posture to obtain a plurality of feature maps with different scales comprises:

7. The method according to claim 1, wherein the step of removing redundant prediction boxes by using Soft-NMS algorithm to obtain target prediction boxes and recognition results comprises:

8. A traffic police gesture recognition apparatus, comprising:

9. A computer device comprising a memory and a processor connected to the memory, the memory having stored therein a computer program operable on the processor, wherein the processor, when executing the computer program, implements a traffic police gesture recognition method as claimed in any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a program file of a traffic police gesture recognition method according to any one of claims 1 to 7.