CN115601717B

CN115601717B - Deep learning-based traffic offence behavior classification detection method and SoC chip

Info

Publication number: CN115601717B
Application number: CN202211280838.1A
Authority: CN
Inventors: 王嘉诚; 张少仲; 张栩
Original assignee: Zhongcheng Hualong Computer Technology Co Ltd
Current assignee: Zhongcheng Hualong Computer Technology Co Ltd
Priority date: 2022-10-19
Filing date: 2022-10-19
Publication date: 2023-10-10
Anticipated expiration: 2042-10-19
Also published as: CN115601717A

Abstract

The application discloses a deep learning-based traffic offence classification detection method and a SoC chip, which belong to the technical field of computer vision, wherein a general processor and a plurality of different types of neural network processors are integrated on the SoC chip, real-time videos are collected and marked on a road surface, traffic indication marks, signal lamps and traffic participants in the videos through a classification neural network model, classification is carried out according to different traffic participants, and different traffic offence detection models are input for traffic offence detection and offence object identification. Compared with the method for classifying, labeling and identifying all types of traffic participants through a single neural network model, the method for detecting the illegal behaviors of different road traffic participants through training different algorithm models has the advantages that the neural network model suitable for each stage of task is adopted, so that the algorithm complexity of the traffic illegal detection model can be effectively reduced, the overall detection efficiency is improved, and the real-time requirement of detecting the road traffic illegal behaviors is fully met.

Description

Deep learning-based traffic offence behavior classification detection method and SoC chip

Technical Field

The application belongs to the technical field of computer vision, and particularly relates to a traffic offence behavior classification detection method based on deep learning and an SoC chip.

Background

In recent years, with the improvement of national living standard, the national automobile storage capacity is higher and higher, and more vehicles run on roads. Under the complex traffic environment of the road, the vehicle violation often happens, and the road safety and the personal and property safety of the vehicle owners are seriously threatened. For traffic illegal behaviors of motor vehicles, a fixed electronic monitoring device snapshot system and a mobile snapshot road traffic technology monitoring device exist at present, and manual screening is carried out on a photographed image road, so that a good effect cannot be achieved in efficiency, and labor cost is increased. With the development of computer multimedia technology and image processing technology in recent years, the component occupied by the illegal action judgment based on video in intelligent traffic is larger and larger, and the research strength invested in various communities is also larger and larger. Because of the increasing importance of China on road monitoring, the video detection technology has become the most important information acquisition means in the intelligent traffic field, and comprehensive evaluation has great feasibility when being applied to road traffic illegal behavior detection.

In order to solve the problems of low manual screening efficiency and high labor cost of vehicle traffic violation behaviors, the Chinese patent of the application with publication number of CN113177443A discloses a method for intelligently identifying road traffic violations based on image vision, which comprises the following steps: modeling and pre-training after modeling; lane line detection classification and lane number determination; detecting pedestrians, zebra crossings, traffic lights and bus lane marks; vehicle detection and tracking; judging traffic violations; and (3) utilizing a monocular plane camera and an edge computing function, performing target detection and classification on a series of captured frame images by adopting a deep learning method, and then judging whether the detected vehicle generates traffic violations through logic.

Meanwhile, the traffic illegal behaviors of pedestrians are more serious, compared with the traffic illegal behaviors of vehicles, the traffic illegal behaviors of pedestrians obviously lack of supervision measures, and the pedestrians are mainly characterized by large walking randomness, changeable directions and easy danger, so that the supervision of the pedestrians participating in the road traffic is also urgent. The detection of traffic violations cannot be limited to vehicle traffic violations and should also be made for pedestrians.

The publication number CN112528759A discloses a traffic offence detection method based on computer vision, which detects the offence of other road traffic participants in addition to the offence of vehicles in traffic scenes. The scheme is as follows: through computer vision technology, real-time target recognition is carried out on objects such as pedestrians, different types of motor vehicles, traffic indicator lights, traffic lanes, zebra crossings, license plates, car logos and the like in a traffic scene, and meanwhile, real-time detection and statistics are carried out on information such as the speed of the motor vehicles, the flow rate of the motor vehicles and the like, and the traffic supervisor is assisted in carrying out traffic supervision on behaviors which violate traffic road regulation rules such as red light running of pedestrians, red light running of the motor vehicles, no traffic of pedestrians and overspeed running of the motor vehicles and the like. However, the scheme detects illegal behaviors of different traffic participants in a traffic scene through a single deep learning algorithm model, and has the problems of high algorithm complexity, insufficient detection precision and low detection efficiency.

Disclosure of Invention

The application provides a traffic illegal behavior classification detection method based on deep learning and an SoC chip, and aims to solve the problems of high algorithm complexity, insufficient detection precision and lower detection efficiency.

In order to solve the technical problems, the method comprises the steps of firstly classifying the pictures acquired in the traffic scene according to different traffic participants, and then respectively inputting the classified pictures into corresponding neural network models to detect and judge traffic illegal behaviors. The specific scheme is as follows:

the traffic illegal behavior classification detection method based on deep learning comprises the following steps:

s1: and acquiring traffic scene videos through the imaging equipment, and decomposing the traffic scene videos into a plurality of continuous video frames.

S2: inputting the video frames into a trained classified neural network model, and marking road surfaces, traffic indication marks, signal lamps and traffic participants in the video frames, wherein the classified neural network model adopts an improved YOLOv5s neural network model; the improved YOLOv5s neural network model reduces the number of layers and the number of channels of each layer of the neural network of the model, and improves the flow of the spatial pyramid pooling structure into a characteristic diagram which sequentially passes through three maximum pooling layers;

s3: the classification neural network model classifies video frames according to traffic participants, classifies the video frames with vehicle labels as vehicle feature graphs, and classifies the video frames with pedestrian labels as pedestrian feature graphs;

s4: inputting the vehicle feature map into a trained vehicle traffic violation detection model, judging vehicle traffic violation behaviors, and if the vehicle feature map is judged to be traffic violation behaviors, generating a vehicle traffic violation image for subsequent processing after license plate recognition; the vehicle traffic violation detection model comprises a vehicle violation judging algorithm and a license plate recognition model, wherein the vehicle violation judging algorithm judges whether the vehicle is illegal or not according to the running track of the vehicle, road surface identification, traffic indication identification and signal lamp state, the license plate recognition model adopts an SVM model, and a license plate is recognized through a license plate positioning module, a license plate character segmentation module and a license plate character recognition module;

s5: inputting the pedestrian feature map into a trained pedestrian traffic violation detection model, judging pedestrian traffic violation behaviors, and if the pedestrian feature map is judged to be traffic violation behaviors, generating a pedestrian traffic violation image for subsequent processing after face recognition; the pedestrian traffic violation detection model comprises a violation judging algorithm and a face recognition model, the violation judging algorithm judges whether the vehicle is illegal or not according to the action track of the pedestrian, the road surface mark, the traffic indication mark and the signal lamp state, the face recognition model adopts a simplified Retinaface model, and the simplified Retinaface model further comprises: the system comprises a trunk feature extraction network, an FPN feature pyramid network, an SSH feature enhancement network and a face target output end.

Preferably, the improved YOLOv5s neural network model comprises:

and the input end outputs a characteristic diagram A of three RGB channels with the resolution of 640 multiplied by 640 by using a Mosaic method, an adaptive anchor frame calculation method and an adaptive picture scaling method.

The characteristic extraction network is used for receiving the characteristic image A, extracting the characteristic image A through a convolution network and detecting a subsequent target; the feature map A sequentially passes through a Focus module, a CBL module, a CSP1 module, a CBL module and a CSP1 module of the feature extraction network and then outputs a feature map B; the feature map B sequentially passes through a CBL module and a CSP1 module of the feature extraction network and then outputs a feature map C; and the feature map C sequentially passes through a CBL module, an SPPF module, a CSP2 module and a CBL module of the feature extraction network and then outputs a feature map D.

The feature fusion network is used for mixing and combining the feature images at different stages and transmitting the mixed and combined feature images to an output end for prediction; in the feature fusion network, a feature map B, a feature map C and a feature map D are respectively fused with feature maps of different stages of the feature fusion network to generate feature maps F, G and H with different sizes to an output end for target detection.

And the output end outputs three target detection results with different sizes through a classification Loss function and a regression Loss function after carrying out 3X 3 convolution on the feature map F, the feature map G and the feature map H, wherein the classification Loss function adopts a cross entropy Loss function, and the regression Loss function adopts CIOU_Loss.

Preferably, the training of the improved YOLOv5s neural network model specifically comprises the following steps:

s2-1: and establishing a traffic scene data set, wherein the traffic scene data set is divided into a training set and a verification set according to a set proportion.

S2-2: and sequentially inputting the training set into an input end, a feature extraction network, a feature fusion network and an output end of the improved YOLOv5s neural network, predicting the positions and classifications of the road surface, the traffic indication sign, the signal lamp and the traffic participant in the training set, and outputting a prediction result, wherein the prediction result is compared with the verification set for verification.

S2-3: and repeating the step S2-2 until the set training times are reached, and storing the last trained model as a trained classified neural network model.

Preferably, the mosaics data enhancement operation in the input end uses four pictures to splice in a random scaling, random cutting and random arrangement mode; the self-adaptive anchor frame calculates self-adaptively to calculate the optimal anchor frame values in different training sets, calculates the optimal recall rate aiming at the default anchor frame, and if the optimal recall rate is lower than 0.98, recalculates the anchor frame; the self-adaptive picture scaling method comprises the steps of calculating scaling ratios of width and height according to the input image size and the output feature image size, calculating the actual size after scaling according to the scaling ratios, and calculating gray filling values according to the actual size to align to 640 multiplied by 640 to output feature image A size.

Preferably, the feature extraction network further comprises three groups of convolutions, a CSP module is added in each group of convolutions, the CSP module breaks the feature map into two parts, one part carries out convolution operation, and the result of the convolution operation of the other part and the last part carries out feature fusion operation with increased channel number; a third set of convolutions employs a modified spatial pyramid pooling structure.

Preferably, the improved spatial pyramid pooling structure outputs the input feature map in two paths after passing through the CBL module once, one path of feature map passes through three 5×5 max pooling layers in series, each max pooling layer outputs one feature map and the other path of feature map for tensor splicing operation, and the spliced feature map is output to the CSP2 module after passing through the CBL module.

Preferably, the feature fusion network adopts a feature pyramid network and a path aggregation network to aggregate the image features at the stage, and the fusion mode of the feature map is specifically as follows:

the feature map D performs tensor splicing operation with the feature map C after up-sampling operation, the spliced feature map sequentially passes through the CSP2 module and the CBL module and then outputs a feature map E, the feature map E performs tensor splicing operation with the feature map B after up-sampling again, and the spliced feature map outputs a feature map F through one CSP2 module; the feature map F performs tensor splicing operation with the feature map E after passing through a CBL module, and the spliced feature map outputs a feature map G through a CSP2 module; the feature map G performs tensor splicing operation with the feature map D after passing through a CBL module, and the spliced feature map outputs a feature map H through a CSP2 module; and the feature map F, the feature map G and the feature map H are used as final output of a feature fusion network and are input to the output end of the improved YOLOv5s neural network model.

Preferably, the ciou_loss algorithm measures the coincidence degree between the real frame and the predicted frame, and sets a minimum rectangle capable of wrapping the real frame and the predicted frame, so as to evaluate the distance between the two frames, wherein the diagonal distance of the rectangle isThe center distance of the real frame and the predicted frame is introduced +.>To evaluate the situation that two frames are wrapped around each other, the aspect ratio of the real frame and the predicted frame is introduced>With this, the center points of the two frames are evaluated to be coincident, and the formula of the ciou_loss algorithm is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,i.e. the degree of coincidence of the real frame and the predicted frame, < >>Is the intersection ratio of the real frame and the predicted frame, < >>Is the width of the prediction frame, +.>Is the height of the prediction box, +.>Is the width of the real frame, +.>Is the height of the real frame.

Preferably, the improved YOLOv5s neural network model inputs the marked feature map classified as a vehicle into a vehicle traffic violation detection model, the vehicle traffic violation detection model carries out illegal judgment and license plate recognition on the feature map classified as the vehicle, carries out traffic illegal judgment on the vehicle based on the running track of the vehicle, the road surface mark, the traffic indication mark and the signal lamp state, and further carries out license plate recognition if the traffic illegal action is established; the license plate recognition adopts a trained SVM model.

Preferably, the improved YOLOv5s neural network model inputs the marked feature map classified as a pedestrian into a pedestrian traffic violation detection model, and the pedestrian traffic violation detection model carries out violation judgment and face recognition on the marked feature map classified as the pedestrian; based on the action track of the pedestrian, the road surface mark, the traffic indication mark and the signal lamp state, judging the traffic violation of the pedestrian, and if the traffic violation is true, further carrying out face recognition; the face recognition adopts a simplified Retinaface model which is completed through training, and the simplified Retinaface model further comprises a trunk feature extraction network, an FPN feature pyramid network, an SSH feature enhancement network and a face target output end which are sequentially arranged.

The utility model provides a traffic offence behavior classification detects SoC chip based on deep learning, the SoC chip includes general purpose processor and a plurality of neural network processor, general purpose processor is through the operation of custom instruction control a plurality of neural network processor, a plurality of neural network processor further includes: a neural network processor for image classification, a neural network processor for vehicle traffic violation detection, and a neural network processor for pedestrian traffic violation detection, for performing the above-described methods.

Compared with the prior art, the application has the following technical effects:

1. the improved YOLOv5s neural network model is utilized to mark and classify the traffic images acquired in real time according to different traffic participants, the characteristics of small network depth and small feature diagram width of the improved YOLOv5s neural network model are fully utilized, the operation complexity is reduced, the neural network model is enabled to be light and accurate, the improved YOLOv5s neural network model is applied to the traffic scene to perform target detection classification mainly on a large target, and the target detection classification efficiency can be effectively improved.

2. The improved YOLOv5s neural network model has very small volume and less memory use, can be easily deployed in embedded equipment, can effectively reduce the complexity of detection equipment and increases the use scene of the equipment.

3. In the improved YOLOv5s neural network model, an SPPF module is adopted to replace an SPP module, tensor splicing is carried out after feature graphs pass through three 5X 5 maximum pooling layers in series, compared with tensor splicing carried out after the feature graphs pass through the maximum pooling layers in parallel in the SPP module, the calculation results of the two are the same, the calculation speed of the SPPF module is 2.5 times that of the SPP module, and the feature extraction efficiency is improved.

4. By training different algorithm models to detect the illegal behaviors of different road traffic participants, compared with the method of classifying, labeling and identifying the traffic participants by a single neural network model, the method can effectively reduce the algorithm complexity of the traffic illegal detection model, improve the overall detection efficiency and fully meet the real-time requirement of road traffic illegal behavior detection by adopting the neural network model suitable for each stage of tasks.

Drawings

FIG. 1 is a flow chart of a deep learning-based traffic offence classification detection method of the present application;

FIG. 2 is a schematic diagram of an improved YOLOv5s model structure of the deep learning-based traffic offence classification detection method of the present application;

fig. 3 is a schematic diagram of a block regression algorithm CIoU of the deep learning-based traffic offence classification detection method according to the present application.

In the figure: 1. an input end; 2. a feature extraction network; 3. a feature fusion network; 4. and an output terminal.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings in conjunction with specific embodiments of the present application.

Referring to fig. 1 to 3, the traffic offence classification detection method based on deep learning includes the following steps:

s1: the method comprises the steps that traffic scene videos are collected through imaging equipment, the imaging equipment comprises vehicle-mounted mobile shooting equipment and fixed monitoring equipment, the collected traffic scene videos are decomposed into a plurality of continuous video frames according to the shooting frame number of the videos, and the generated images are used for classifying neural network models to detect and mark targets.

S2: the method comprises the steps of inputting a video frame into a trained classified neural network model, marking road surfaces, traffic indication marks, signal lamps and traffic participants in the video frame, classifying according to the traffic participants, respectively generating a vehicle characteristic image and a pedestrian characteristic image, wherein the classified neural network model adopts an improved YOLOv5s neural network model, compared with the YOLOv5 neural network model, the improved YOLOv5s neural network model reduces the number of neural network layers and the number of channels of each layer of the model, improves the flow of a spatial pyramid pooling structure into a characteristic image, sequentially passes through three layers of maximum pooling layers, fully utilizes the characteristics of few improved YOLOv5s neural network models and small characteristic image width, reduces the complexity of operation, ensures that the neural network model is light in weight and simultaneously keeps accurate, and is applied to target detection classification under a traffic scene with a large target as a main object, so that the efficiency of target detection classification can be effectively improved.

S3: the classification neural network model classifies video frames according to traffic participants, classifies video frames with vehicle labels as vehicle feature maps, and classifies video frames with pedestrian labels as pedestrian feature maps. By classifying the video frames, traffic offence pictures of different traffic participants can be sent to corresponding neural network models for traffic offence detection and traffic participant identification, and the accuracy and efficiency of traffic offence detection can be effectively improved.

S4: and inputting the vehicle feature map into a trained vehicle traffic violation detection model, judging the vehicle traffic violation, and if the vehicle feature map is judged to be the traffic violation, generating a vehicle traffic violation image for subsequent processing. The vehicle traffic violation detection model comprises a violation judging algorithm and a license plate recognition model, wherein the violation judging algorithm judges whether the vehicle is illegal or not according to the running track of the vehicle, the road surface mark, the traffic indication mark and the signal lamp state, the license plate recognition model adopts an SVM model, and the license plate is recognized through a license plate positioning module, a license plate character segmentation module and a license plate character recognition module.

S5: and inputting the pedestrian feature map into a trained pedestrian traffic violation detection model, judging the pedestrian traffic violation, and if the pedestrian feature map is judged to be the traffic violation, generating a pedestrian traffic violation image for subsequent processing. The pedestrian traffic violation detection model comprises a violation judging algorithm and a face recognition model, wherein the violation judging algorithm judges whether the vehicle is illegal or not according to the action track of the pedestrian, the road surface mark, the traffic indication mark and the signal lamp state, the face recognition model adopts a simplified RetinaFace model, and the simplified RetinaFace model further comprises: the system comprises a trunk feature extraction network, an FPN feature pyramid network, an SSH feature enhancement network and a face target output end.

The improved YOLOv5s neural network model structure comprises an input end 1, a feature extraction network 2, a feature fusion network 3 and an output end 4.

The Input end 1 (Input), the Input end 1 uses the Mosaic data enhancement operation to improve the training speed of the model and the network precision, and provides a self-adaptive anchor frame calculation and self-adaptive picture scaling method, and the Input end 1 outputs a characteristic diagram A of three channels of RGB with the resolution of 640 multiplied by 640.

The mosaics data enhancement operation uses four pictures to splice in a random zooming, random cutting and random arrangement mode, so that the pictures can enrich the background of a detection target, and meanwhile, the detection effect of a small target can be improved; the self-adaptive anchor frame calculates self-adaptive optimal anchor frame values in different training sets, and an anchor frame with specific length and width is required to be set for different data sets, in the network training stage, a model outputs a corresponding prediction frame on the basis of an initial anchor frame, calculates the difference between the prediction frame and a real frame, and performs reverse updating operation, so that parameters of the whole network are updated; the self-adaptive picture scaling method comprises the steps of calculating scaling ratios of width and height according to the input image size and the output feature image size, calculating the scaled actual size according to the scaling ratios, and finally calculating gray filling values according to the actual size to output a feature image A in an aligned 640X 640 RGB three-channel mode.

A feature extraction network 2 (backhaul) for receiving the feature map a, and extracting the feature map a through a convolutional network for subsequent target detection; the feature map A sequentially passes through a Focus module, a CBL module, a CSP1 module, a CBL module and a CSP1 module of the feature extraction network to output a feature map B, which is a first group of convolution; the feature map B sequentially passes through a CBL module and a CSP1 module of the feature extraction network to output a feature map C, which is a second group of convolution; the feature map C sequentially passes through the CBL module, SPPF module, CSP2 module, and CBL module of the feature extraction network, and then outputs a feature map D, which is a third set of convolutions.

The method comprises the steps that a Focus module carries out slicing on an input feature map A, four slicing operations are carried out on the feature map A in parallel to generate four 320 multiplied by 12 intermediate feature maps I, tensor splicing is carried out on the feature maps, and the feature maps I pass through a CBL module with 32 convolution kernels at a time to finally generate 320 multiplied by 32 intermediate feature maps II;

the CBL module is the most basic module in YOLOv5s, in order Conv convolution, BN (Batch Normalization ), and LeakyRelu activation functions; continuing to convolve the intermediate feature map II generated by the Focus module in the CBL module to generate an intermediate feature map III of 160 multiplied by 64;

the CSP module in this embodiment has two structures, the CSP1 structure is applied to the feature extraction network 2 (backhaul), and the CSP2 structure is applied to the feature fusion network 3 (Neck); the CSP1 module divides the three middle feature images into two paths in parallel, one path sequentially passes through the CBL module, n residual error components (ResUnit) and one convolution, tensor splicing operation is carried out on the other path after one convolution, the generated feature images pass through the one-time BN batch normalization module, the activation function LeakyRelu and the CBL module to generate 80 multiplied by 128 middle feature images, and the middle feature images pass through the CSP1 module again to generate 80 multiplied by 128 feature images B; the residual structure is added, so that the gradient value of back propagation between layers can be increased, gradient disappearance caused by deepening is avoided, and therefore, finer granularity characteristics can be extracted without worrying about network degradation;

the feature map B generates a feature map C through a second group of convolution, the second group of convolution comprises a CBL module and a CSP1 module, the size of the feature map B is changed into 40 multiplied by 256 after passing through the CBL module, and the feature map C with the same size is output after passing through the CSP1 module;

the feature map C generates a feature map D through a third group of convolutions, wherein the third group of convolutions comprise a CBL module, an SPPF module, a CSP2 module and a CBL module, the feature map is changed into 20 multiplied by 512 through the CBL module for the first time, and the feature map D with the size of 20 multiplied by 256 is generated through the CBL module for the second time; the SPPF module outputs the input feature map in two paths after passing through the CBL module once, one path of feature map passes through three 5 multiplied by 5 maximum pooling layers in series, each maximum pooling layer outputs one feature map to carry out tensor splicing operation with the other path of feature map, and the spliced feature map is output to the CSP2 module after passing through the CBL module; the CSP2 module is similar in structure to the CPS1 module except that the n residual components are replaced with 2n CBL modules.

A feature fusion network 3 (Neck) for mixing and combining feature images at different stages, enhancing the robustness of the network, enhancing the object detection capability, and transmitting the features to an output end for prediction; in the feature fusion network, feature images B, C and D are fused in different layers to generate feature images F, G and H with different sizes to an output end for target detection. The specific fusion process is as follows:

the feature map D performs tensor splicing operation with the feature map C after upsampling operation, the size of the spliced intermediate feature map is 40 multiplied by 512, the feature map E with the size of 40 multiplied by 128 is output after passing through a CSP2 module and a CBL module in sequence, the feature map E performs tensor splicing operation with the feature map B after upsampling again, the size of the spliced intermediate feature map is 80 multiplied by 256, and the feature map F with the size of 80 multiplied by 128 is output after passing through the CSP2 module; the feature map F is subjected to tensor splicing operation with the feature map E after passing through a CBL module, the size of the spliced feature map is 40 multiplied by 256, and a feature map G with the size of 40 multiplied by 128 is output through a CSP2 module; the feature map G is subjected to tensor splicing operation with the feature map D after passing through a CBL module, the size of the spliced feature map is 20 multiplied by 512, and a feature map H with the size of 20 multiplied by 128 is output through a CSP2 module; the feature map F, the feature map G and the feature map H are used as final outputs of the feature fusion network and are input to an output end 4 of the improved YOLOv5s neural network model.

And an Output end 4 (Output) for respectively carrying out 3×3 convolution on the feature map F, the feature map G and the feature map H, and outputting three target detection results with different sizes through a classification loss function, a regression loss function and a confidence loss function, wherein the classification loss function adopts a cross entropy loss function and is used for calculating whether the anchor frame and the corresponding calibration classification are correct. The regression Loss function uses CIOU_Loss for predicting the error between the box and the real box. The confidence loss function is calculated by sample pairs obtained by positive sample matching, namely, target confidence score in a prediction frame, and IOU values of the prediction frame and a target frame corresponding to the prediction frame are used as real frames, and binary cross entropy is calculated by the two to obtain final target confidence loss, so that the confidence of a network is calculated.

The Output end 4 (Output) adopts a CIoU_Loss algorithm to measure the coincidence degree between the real frame and the predicted frame, the CIoU_Loss algorithm sets a minimum rectangle which can wrap the real frame and the predicted frame, the distance between the two frames is evaluated, and the diagonal distance of the rectangle isThe center distance of the real frame and the predicted frame is introduced +.>To evaluate the situation that two frames are wrapped around each other, the aspect ratio of the real frame and the predicted frame is introduced>With this, the center points of the two frames are evaluated to be coincident, and the formula of the ciou_loss algorithm is as follows:

The training of the improved YOLOv5s neural network model specifically comprises the following steps:

s2-1: establishing a traffic scene data set, wherein the traffic scene data set is divided into a training set and a verification set according to the proportion of 8:2; the traffic scene data set can adopt a UA-DETRAC data set which is mainly shot on a road overpass of Beijing and Tianjin, and is manually marked with 8250 vehicles and a 121 ten thousand target object outer frame. Vehicles are divided into four classes, namely cars, buses, vans and other vehicles; weather conditions fall into four categories, i.e., cloudy, night, sunny and rainy.

S2-2: the training set is sequentially input into an input end 1, a feature extraction network 2, a feature fusion network 3 and an output end 4 of the improved YOLOv5s neural network, the positions and classifications of the road surface, the traffic indication sign, the signal lamp and the traffic participants in the training set are predicted, the prediction result is output, and the comparison verification is carried out on the prediction result and the verification set.

The improved YOLOv5s neural network model inputs the feature map classified as the vehicle into a vehicle traffic violation detection model, the vehicle traffic violation detection model carries out violation judgment and license plate recognition on the feature map, the vehicle is subjected to traffic violation judgment based on the running track of the vehicle, the road surface mark, the traffic indication mark and the signal lamp state, and if the traffic violation is met, the license plate recognition is further carried out; the license plate recognition adopts a trained SVM model. The license plate recognition by using the SVM model comprises the following steps:

positioning a license plate, converting gray scales, converting a color picture into a gray scale image, and commonly obtaining an average value of R=G=B=pixels; removing noise by Gaussian smoothing and median filtering; performing binarization processing to convert the image into black and white, wherein the gray value of the pixel is set to 255 when the gray value is more than 127, and is set to 0 when the gray value is less than 127; the canny edge detection is carried out, the closing operation and the opening operation are carried out, the small area is eliminated, and the large area is reserved, so that the license plate position is positioned; expanding and thinning, amplifying the image outline, and converting the image outline into areas which contain license plates; selecting a proper license plate position through an algorithm, and filtering out or searching a blue bottom area in a smaller area; marking the license plate position and extracting the license plate.

And (3) dividing the license plate characters, wherein only black pixels and white pixels exist in the image, so that the characters can be divided through the white pixels and the black pixels of the image, and the characters are positioned by judging the positions of the black-white pixel values of each row and each column respectively.

And the character recognition of the license plate is respectively used for recognizing Chinese province abbreviations and subsequent letters and numbers on the license plate by training two SVM models.

The improved YOLOv5s neural network model inputs the marked feature images classified as pedestrians into a pedestrian traffic violation detection model, and the pedestrian traffic violation detection model carries out violation judgment and face recognition on the feature images; based on the action track of the pedestrian, the road surface mark, the traffic indication mark and the signal lamp state, judging the traffic violation of the pedestrian, and if the traffic violation is true, further carrying out face recognition; the face recognition adopts a simplified Retinaface model which is completed through training, and the simplified Retinaface model further comprises a trunk feature extraction network, an FPN feature pyramid network, an SSH feature enhancement network and a face target output end which are sequentially arranged.

The backbone feature extraction network adopts a lightweight version mcet based on mobilet, so that the detection speed is faster. The mcet reserves 3 layers of feature graphs of a feature pyramid network, generates 3 detection frames on different scales, introduces anchor frame sizes with different sizes on each scale, and ensures that faces with different sizes can be detected; the FPN feature pyramid network firstly adjusts the channel number of three effective feature graphs by using 1x1 convolution, and then carries out up-sampling feature fusion by using up-sampling and tensor addition after adjustment, and outputs three feature graphs C1, C2 and C3; three effective feature layers P1, P2 and P3 are obtained through the FPN feature pyramid network, and in order to further strengthen feature extraction, an SSH module is used for strengthening a receptive field (the size of a region on an input image is mapped back by pixel points on a feature map output by each layer of the convolutional neural network), and the SSH module comprises three detection modules: detection Module M3 for detecting large faces, detection Module M2 for detecting medium faces, and Detection Module M1 for detecting small faces. The SSH module improves the detection of the small face by introducing context information into the feature map and outputs three effective feature layers; the face target output end obtains a prediction result through three effective feature layers S1, S2 and S3, and the prediction result is divided into three: classification prediction, face frame prediction and face key point prediction.

Traffic offence behavior classification detects SoC chip based on deep learning, and SoC chip includes general treater and a plurality of neural network processor, and general treater is through the operation of custom instruction control a plurality of neural network processor, and a plurality of neural network processor further includes: a neural network processor for image classification, a neural network processor for vehicle traffic violation detection, and a neural network processor for pedestrian traffic violation detection, for performing the above-described methods.

The foregoing is merely a preferred embodiment of the present application, and it should be noted that modifications and improvements could be made by those skilled in the art without departing from the inventive concept, which falls within the scope of the present application.

Claims

1. The traffic illegal behavior classification detection method based on deep learning is characterized by comprising the following steps of:

s1: collecting traffic scene videos through imaging equipment, and decomposing the traffic scene videos into a plurality of continuous video frames;

s5: inputting the pedestrian feature map into a trained pedestrian traffic violation detection model, judging pedestrian traffic violation behaviors, and if the pedestrian feature map is judged to be traffic violation behaviors, generating a pedestrian traffic violation image for subsequent processing after face recognition; the pedestrian traffic violation detection model comprises a pedestrian violation judgment algorithm and a face recognition model, wherein the pedestrian violation judgment algorithm judges whether a vehicle is illegal or not according to the action track of a pedestrian, road surface identification, traffic indication identification and signal lamp state, the face recognition model adopts a simplified Retinaface model, and the simplified Retinaface model further comprises: a trunk feature extraction network, an FPN feature pyramid network, an SSH feature enhancement network and a face target output end;

s2-1: establishing a traffic scene data set, wherein the traffic scene data set is divided into a training set and a verification set according to a set proportion;

s2-2: sequentially inputting the training set into an input end, a feature extraction network, a feature fusion network and an output end of an improved YOLOv5s neural network, predicting the positions and classifications of the road surface, traffic indication marks, signal lamps and traffic participants in the training set, and outputting a prediction result, wherein the prediction result is compared with the verification set for verification;

s2-3: repeating the step S2-2 until the set training times are reached, and storing the last trained model as a trained classified neural network model;

the improved YOLOv5s neural network model includes:

the input end uses a Mosaic method, a self-adaptive anchor frame calculation and a self-adaptive picture scaling method to output a characteristic diagram A of three channels of RGB with the resolution of 640 multiplied by 640;

the characteristic extraction network is used for receiving the characteristic image A, extracting the characteristic image A through a convolution network and detecting a subsequent target; the feature map A sequentially passes through a Focus module, a CBL module, a CSP1 module, a CBL module and a CSP1 module of the feature extraction network and then outputs a feature map B; the feature map B sequentially passes through a CBL module and a CSP1 module of the feature extraction network and then outputs a feature map C; the feature map C sequentially passes through a CBL module, an SPPF module, a CSP2 module and a CBL module of the feature extraction network and then outputs a feature map D;

the feature fusion network is used for mixing and combining the feature images at different stages and transmitting the mixed and combined feature images to an output end for prediction; in the feature fusion network, a feature map B, a feature map C and a feature map D are respectively fused with feature maps of different stages of the feature fusion network to generate feature maps F, G and H with different sizes to an output end for target detection;

the output end is used for respectively carrying out 3×3 convolution on the feature map F, the feature map G and the feature map H, and then outputting three target detection results with different sizes through a classification Loss function and a regression Loss function, wherein the classification Loss function adopts a cross entropy Loss function, and the regression Loss function adopts CIOU_Loss;

the CIoU_Loss algorithm is used for measuring the coincidence degree between a real frame and a predicted frame, a minimum rectangle capable of wrapping the real frame and the predicted frame is set by the CIoU_Loss algorithm, the distance between the two frames is evaluated, the diagonal distance of the rectangle is c, the center point distance d between the real frame and the predicted frame is introduced, the condition that the two frames are wrapped with each other is evaluated, the aspect ratio v of the real frame and the predicted frame is introduced, the condition that the center points of the two frames coincide is evaluated, and the formula of the CIoU_Loss algorithm is as follows:

wherein CIoU is the coincidence degree of the real frame and the predicted frame, ioU is the intersection ratio of the real frame and the predicted frame, w is the width of the predicted frame, h is the height of the predicted frame, and w ^gt Is the width of the real frame, h ^gt Is the height of the real frame;

the mosaics data enhancement operation in the input end uses four pictures to splice in a random scaling, random cutting and random arrangement mode; the self-adaptive anchor frame calculates self-adaptively to calculate the optimal anchor frame values in different training sets, calculates the optimal recall rate aiming at the default anchor frame, and if the optimal recall rate is lower than 0.98, recalculates the anchor frame; the self-adaptive picture scaling method comprises the steps of calculating scaling ratios of width and height according to the input image size and the output feature image size, calculating the actual size after scaling according to the scaling ratios, and calculating gray filling values according to the actual size to align to 640 multiplied by 640 to output feature image A size.

2. The deep learning-based traffic offence classification detection method of claim 1, wherein the feature extraction network further comprises three sets of convolutions, a CSP module is added to each set of convolutions, the CSP module breaks down a feature map into two parts, one part performs a convolution operation, and the results of the other part and the last part perform a feature fusion operation with increased channel numbers; a third set of convolutions employs a modified spatial pyramid pooling structure.

3. The method for classifying and detecting traffic offence based on deep learning according to claim 2, wherein the improved spatial pyramid pooling structure outputs the input feature map in two paths after passing through the CBL module once, one path of feature map passes through three 5×5 maximum pooling layers in series, each maximum pooling layer outputs one feature map and the other path of feature map to perform tensor stitching operation, and the stitched feature map is output to the CSP2 module after passing through the CBL module.

4. The traffic offence classification detection method based on deep learning as claimed in claim 1, wherein the feature fusion network adopts a feature pyramid network and a path aggregation network to aggregate the image features of the stage, and the fusion mode of the feature map is specifically as follows:

5. The deep learning-based traffic offence classification detection method of claim 1, wherein the improved YOLOv5s neural network model inputs a marked feature map classified as a vehicle into a vehicle traffic offence detection model, the vehicle traffic offence detection model performs offence judgment and license plate recognition on the feature map classified as a vehicle, performs traffic offence judgment on the vehicle based on a running track of the vehicle, a road surface identification, a traffic indication sign and a signal lamp state, and further performs license plate recognition if traffic offence is established; the license plate recognition adopts a trained SVM model.

6. The deep learning-based traffic offence classification detection method of claim 1, wherein the modified YOLOv5s neural network model inputs a labeled feature map classified as a pedestrian into a pedestrian traffic offence detection model that performs offence determination and face recognition on the labeled feature map classified as a pedestrian; based on the action track of the pedestrian, the road surface mark, the traffic indication mark and the signal lamp state, judging the traffic violation of the pedestrian, and if the traffic violation is true, further carrying out face recognition; the face recognition adopts a simplified Retinaface model which is completed through training, and the simplified Retinaface model further comprises a trunk feature extraction network, an FPN feature pyramid network, an SSH feature enhancement network and a face target output end which are sequentially arranged.

7. The traffic offence behavior classification detection SoC chip based on deep learning is characterized in that the SoC chip comprises a general processor and a plurality of neural network processors, the general processor controls the plurality of neural network processors to run through a custom instruction, and the plurality of neural network processors further comprise: a neural network processor for image classification, a neural network processor for vehicle traffic violation detection, and a neural network processor for pedestrian traffic violation detection, for performing the method of any of claims 1-6.