CN110969130A

CN110969130A - Driver dangerous action identification method and system based on YOLOV3

Info

Publication number: CN110969130A
Application number: CN201911220885.5A
Authority: CN
Inventors: 袁嘉言
Original assignee: Xiamen Ruiwei Information Technology Co ltd
Current assignee: Xiamen Ruiwei Information Technology Co ltd
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2020-04-07
Anticipated expiration: 2039-12-03
Also published as: CN110969130B

Abstract

The invention provides a driver dangerous action recognition method based on YOLOV3, which comprises the steps of acquiring an infrared image of a driver, determining the position of a face through a face detection algorithm, and selecting a region to be recognized of driver dangerous action according to the position of the face; using a YOLOV3 algorithm to quickly detect whether a dangerous action state of a driver occurs in the area to be identified; if the YOLOV3 algorithm detects that the driver does dangerous action, extracting dangerous action areas to perform deep learning classification, and determining which dangerous action the driver does; counting for a period of time, if drivers do certain dangerous behaviors, reminding the drivers to pay attention to safe driving, and uploading the dangerous behaviors of the drivers to the cloud; the invention also provides a driver dangerous action recognition system based on the YOLOV3, so that the prediction result is more accurate, and the error recognition of the dangerous behavior of the alarm can be greatly reduced.

Description

Driver dangerous action identification method and system based on YOLOV3

Technical Field

The invention relates to a driver dangerous action identification method and system based on YOLOV 3.

Background

Along with the more and more perfect development of domestic road infrastructure, the number of operating vehicles on road traffic is increasing, and the country pays more attention to the road safety condition. Meanwhile, the deep learning technology develops rapidly in recent years, and a plurality of novel algorithms with good effect emerge in large numbers. The development of the technology and the requirement of scene application drive a large amount of high and new technologies to land on the ground, and the driving safety assistant driving is one direction of deep learning visual landing. Most of traffic accidents are caused by human factors in the driving process, and serious traffic accidents and serious economic losses can be caused by fatigue driving of a driver, drunk driving or improper operation in the driving process. Therefore, it is very important to detect dangerous actions during driving of the driver, so as to effectively reduce traffic accidents. Several exemplary methods for detecting dangerous driver behavior are described below:

(1) the calling identification technology based on the traditional image algorithm classification comprises the following steps: after the detection algorithm detects the face, a large area including the face is intercepted on the basis of the face, and whether a driver calls or not is analyzed by directly using the large area. A common conventional machine learning algorithm is: the face detection method comprises the steps of detecting a face by using an AdaBoost face detection algorithm, wherein AdaBoost is an abbreviation of English Adaptive Boosting, and is a machine learning method commonly used for rapidly detecting the face; intercepting a large area according to the detected face to construct positive and negative samples for making a call, and training a sample library constructed by learning by using an SVM (Support Vector Machine, which is a traditional Machine learning algorithm); and in the prediction stage, the trained SVM model is loaded, the intercepted area to be detected is input, and the probability of predicting whether the classification calls or not is output. The algorithm has the advantages that: the speed is high; algorithm points are as follows: with traditional machine learning classification, the final accuracy is not high due to insufficient learning ability of the algorithm.

(2) The technology for identifying the call based on deep learning algorithm classification comprises the following steps: in view of the above-mentioned insufficient learning ability of the traditional SVM machine learning method, the algorithm is often not fully satisfactory in response to a complex practical application scenario, and therefore the classification method is replaced by a CNN (convolutional neural network) network learning classification from an SVM; when the network is deep enough, the CNN has better characteristic extraction capability on data, and can learn the calling behavior under various complex light rays in actual driving to have better data fitting effect. The algorithm has the advantages that: the deep learning method can fit a large amount of complex data, the learning capacity of data features is stronger, and the prediction effect is better than that of an SVM (support vector machine) under the background of big data; the disadvantages are as follows: in view of the fact that large areas based on human faces are classified, a lot of invalid background information is introduced, and when the background with untrained algorithm is encountered, inexplicable false recognition easily occurs.

(3) The telephone call identification technology based on the deep learning detection method comprises the following steps: the method for classifying, identifying and calling the phone by the input image has the possibility that the situation of pure background is mistakenly identified as that a driver calls the phone, and the driver does not have any hand action at the face at the moment, so that the mistaken identification is difficult to understand. Thus, engineers have proposed detection-based methods to identify driver call actions. The method is characterized in that a modified SSD (Single Shot Multi Box Detector, which is an end-to-end object detection algorithm) algorithm is used for detecting the hand-held call-making action. By using the detection method, not only the classification result in the image can be known, but also the specific calling position when the current action occurs can be known, so that the calling detection method can learn the more specific action occurrence characteristics in the image. The algorithm has the advantages that: false detection of pure background can be effectively solved; the algorithm has the following defects: when a user holds a hand to make a call, the hand characteristics are more obvious than those of the telephone, and the hand is easily recognized as a call by mistake at the ear.

The above methods are gradually updated and iterated in effect along with the development of algorithm technology and hardware equipment, and the effect is better for one generation than for the other generation. However, there still exists a great problem in the use under the actual complex scene. The detection algorithm mentioned in the above (3) can detect the call-making action, and meanwhile, the call-making action is not made by the hand, and the call-making action is also likely to be mistakenly identified as a call-making action, obviously, the algorithm setting is not perfect; in the driving process of the two drivers, dangerous hand motions are more, and the safety is not enough when only one behavior of the calling motion is detected.

Disclosure of Invention

The invention aims to provide a method for predicting dangerous behaviors by using a prediction method, which can enable the prediction result to be more accurate and greatly reduce the false recognition of dangerous behaviors.

One of the present invention is realized by: a driver dangerous action recognition method based on YOLOV3 comprises the following steps:

step 1, acquiring an infrared image of a driver, detecting the face position of the driver by a face detection algorithm, and selecting a region to be identified of dangerous behaviors of the driver according to the face position;

step 2, detecting whether a dangerous action state of a driver occurs in the area to be identified by using a YOLOV3 model, if not, considering that the driver drives normally, and ending; if yes, obtaining a YOLOV3 recognition result, obtaining a YOLOV3 detection area, and entering the step 3;

step 3, identifying the area to be identified detected by the YOLOV3 through a lightcnn model, judging whether a dangerous action state of a driver occurs or not, if not, judging that the driver drives normally, if so, obtaining a lightcnn identification result, and if the lightcnn identification result is the same as the identification result of the YOLOV3 and frames of a first set value continuously occur under the condition, sending an alarm; if the lightcnn recognition result is different from the YOLOV3 recognition result, and this situation occurs continuously for frames of the second setting value, an alarm is given.

Further, the method also comprises a step 4 of uploading the identification result of the YOLOV3 and the identification result of the lightcnn to the cloud.

Further, the step 1 is further specifically: the method comprises the steps of obtaining an infrared image of a driver, detecting the face position of the driver by a face detection algorithm, fixing a downward square on the upper edge of the face, expanding a set multiple, then intercepting the area, normalizing the size of the area to a set size, and obtaining an area to be identified.

Further, the YOLOV3 algorithm is trained in the following manner:

collecting a driver picture of an actual application scene; the dangerous behavior classification of the collected driver pictures comprises the following steps: normal behavior, call, chat, believe, drink water and hold ears and touch face; and the rectangular boxes of 4 types are marked in the occurrence areas of the actions of calling, chatting, drinking, grabbing ears and touching faces, so that the positions and the types of the actions in the whole picture are recorded, and the rectangular boxes are not required to be marked for normal actions;

performing face detection on a sample library picture to record positions (face _ x, face _ y, face _ w and face _ h), fixing the upper edge of a face, and expanding the face by a set multiple downwards to form a dangerous behavior analysis area (roi _ x, roi _ y, roi _ w and roi _ h), wherein roi _ x is face _ x-face _ w (2.5-1)/2, roi _ y is face _ y, roi _ w is 2.5 face _ w and roi _ h is 2.5 face _ h, and (roi _ x, roi _ y, roi _ w and roi _ h) are intercepted from a training sample preprocessing direct remaking image and normalized to a set size; label preprocessing, the labeling label learned by YOLOV3 is the relative offset of the frame, so the real training labels for the target frame are (label _ x, label _ y, label _ w, label _ h), where label _ x is (box _ x-roi _ x)/roi _ w, label _ y is (box _ y-roi _ y)/roi _ h, label _ w is box _ w/roi _ w, label _ h is box _ h/roi _ h; class label _ class, normal behavior class 0, call class 1, chat type 2, water class 3 and ear/face-catching class 4;

training clustering selection of the network candidate boxes, and clustering the proportion of the 6 candidate boxes through a k-mean algorithm, wherein the 6 candidate boxes are divided into 2 network scale outputs, namely each network scale output comprises 3 proportion candidate boxes; the main network is VGG-mobileNet with a large amount of parameters cut, a deep learning framework is used for model training, and the training content is as follows: the method is characterized in that whether dangerous driving behaviors exist or not is predicted by inputting images, if the dangerous driving behaviors exist, the dangerous driving behaviors are specifically selected from the behaviors of making a call, chatting, drinking water or touching the ear and the face, and the accurate position where the dangerous behaviors occur is predicted.

Further, the lightcnn model training mode is as follows:

collecting a sample; sample collection is carried out by intercepting a sample area from a normal labeled marking frame for calling, chatting, drinking and touching the face with ears, and background samples are randomly intercepted from an area without dangerous behaviors in the 4 image; sample size normalized to 128 x 128;

selecting and training a model; the classification model selects a lightcnn network structure, the output of the network is 5 types, namely the normal behavior type is 0, the calling type is 1, the chatting type is 2, the drinking type is 3 and the ear-catching face-touching type is 4; performing target learning by using Softmax cross entropy loss, and carrying out operation of a Softmax cross entropy loss function in a caffe framework; and (5) finishing model training when the Softmax cross entropy loss is stably converged in a smaller area.

The second invention is realized by the following steps: a YOLOV 3-based driver hazardous action recognition system, comprising:

the identification region determining module is used for acquiring an infrared image of a driver, detecting the face position of the driver by a face detection algorithm and selecting a region to be identified of dangerous behaviors of the driver according to the face position;

the first identification module detects whether a dangerous action state of a driver occurs in the area to be identified by using a YOLOV3 model, if not, the driver is considered to drive normally, and the operation is finished; if yes, obtaining a YOLOV3 recognition result, obtaining a YOLOV3 detection area, and entering a recognition and alarm module;

the recognition and alarm module is used for recognizing the area to be recognized detected by the YOLOV3 through a lightcnn model, judging whether a dangerous action state of a driver occurs or not, if not, judging that the driver drives normally, if so, obtaining a lightcnn recognition result, and if the lightcnn recognition result is the same as the YOLOV3 recognition result, continuously generating frames of a first set value, and sending an alarm; if the lightcnn recognition result is different from the YOLOV3 recognition result, and this situation occurs continuously for frames of the second setting value, an alarm is given.

Further, the system further comprises an uploading module, and the uploading module uploads the Yolov3 recognition result and the lightcnn recognition result to the cloud.

Further, the identification area determining module is further specifically configured to: the method comprises the steps of obtaining an infrared image of a driver, detecting the face position of the driver by a face detection algorithm, fixing a downward square on the upper edge of the face, expanding a set multiple, then intercepting the area, normalizing the size of the area to a set size, and obtaining an area to be identified.

Further, the YOLOV3 algorithm is trained in the following manner:

collecting a driver picture of an actual application scene; the dangerous behavior classification of the collected driver pictures comprises the following steps: normal behavior, call, chat, drink, hold ear and touch face 5 types; and the rectangular boxes of 4 types are marked in the occurrence areas of the actions of calling, chatting, drinking, grabbing ears and touching faces, so that the positions and the types of the actions in the whole picture are recorded, and the rectangular boxes are not required to be marked for normal actions;

Further, the lightcnn model training mode is as follows:

The invention has the following advantages:

the method has the advantages that a YOLOV3 detection architecture is introduced to detect dangerous actions of a driver, the detection range is wide, multiple dangerous actions such as calling, chatting, drinking, grabbing ears and touching faces of the driver can be detected at the same time, instead of detecting one action by one model, one model can be used for detecting multiple actions, and meanwhile, the design complexity of a system can be reduced, and the use of resources of the system can be reduced; and secondly, introducing a fine quadratic verification lightcnn classification network, intercepting and classifying the region detected by YOLOV3, namely enabling a lightcnn fine classification model to pay more specific attention to the fine features of the region of interest, so that the prediction result is more accurate and the error identification of the alarm dangerous behavior can be greatly reduced.

Drawings

The invention will be further described with reference to the following examples with reference to the accompanying drawings.

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic view of a prediction flow of Yolov 3;

FIG. 3 is a schematic view of a lightcnn check classification prediction process;

FIG. 4 is a schematic flow chart of the whole process of detecting YOLOV3 and predicting dangerous behavior of driver by lightcnn secondary classification.

Detailed Description

As shown in fig. 1, the method for identifying dangerous actions of drivers based on YOLOV3 of the present invention includes:

step 1, acquiring an infrared image of a driver, detecting the face position of the driver by a face detection algorithm, fixing a downward square by the upper edge of the face, expanding a set multiple, intercepting the area, and normalizing the size of the area to a set size to obtain an area to be identified;

step 2, detecting whether a dangerous action state of a driver occurs in the area to be identified by using a YOLOV3 model, if not, considering that the driver drives normally, and ending; if yes, obtaining a YOLOV3 recognition result, obtaining a YOLOV3 detection area (the YOLOV3 detection area is an area obtained by detecting the area to be recognized and contains a motion image of the driver), and entering step 3;

step 3, identifying a YOLOV3 detection area through a lightcnn model, judging whether a dangerous action state of a driver occurs, if not, judging that the driver drives normally, if so, obtaining a lightcnn identification result, and if the lightcnn identification result is the same as the YOLOV3 identification result, continuously generating frames of a first set value, and sending an alarm; if the lightcnn recognition result is different from the YOLOV3 recognition result, and this situation occurs continuously for frames of the second setting value, an alarm is given.

And 4, uploading the YOLOV3 recognition result and the lightcnn recognition result to a cloud.

The training mode of the Yolov3 algorithm is as follows:

training clustering selection of the network candidate boxes, and clustering the proportion of the 6 candidate boxes through a k-mean algorithm, wherein the 6 candidate boxes are divided into 2 network scale outputs, namely each network scale output comprises 3 proportion candidate boxes; the main network is VGG-mobile Net (a classical network structure) with a great amount of parameters cut, the model training uses a deep learning framework, the training content is as follows: the method is characterized in that whether dangerous driving behaviors exist or not is predicted by inputting images, if the dangerous driving behaviors exist, the dangerous driving behaviors are specifically selected from the behaviors of making a call, chatting, drinking water or touching the ear and the face, and the accurate position where the dangerous behaviors occur is predicted.

The lightcnn model training mode is as follows:

The invention discloses a driver dangerous action recognition system based on YOLOV3, which comprises:

the recognition area determining module is used for acquiring an infrared image of a driver, detecting the face position of the driver by a face detection algorithm, fixing the upper edge of the face to a downward square, expanding a set multiple, intercepting the area, and normalizing the size of the area to a set size to obtain an area to be recognized;

the recognition and alarm module recognizes the YOLOV3 detection area through a lightcnn model, whether a dangerous action state of a driver occurs or not is judged, if not, the driver is considered to drive normally, if so, a lightcnn recognition result is obtained, and if the lightcnn recognition result is the same as the YOLOV3 recognition result and frames of a first set value continuously occur under the condition, an alarm is sent; if the lightcnn recognition result is different from the YOLOV3 recognition result, and this situation occurs continuously for frames of the second setting value, an alarm is given.

And the uploading module uploads the YOLOV3 recognition result and the lightcnn recognition result to the cloud.

The training mode of the Yolov3 algorithm is as follows:

The lightcnn model training mode is as follows:

One specific embodiment of the present invention:

the method for recognizing the dangerous actions of the two-stage driver based on the Yolov3 detection and classification mainly comprises the following three contents: designing and training a detection algorithm for various dangerous behaviors such as calling, chatting, drinking, grabbing ears, touching faces and the like based on a deep learning YOLOV3 framework driver; secondly, the marked drivers call, chat with WeChat, drink water, grab ears, touch faces and mistakenly detect background areas, arrange the background areas into 5 types of samples, and train a fine CNN check classification network; and thirdly, a YOLOV3 multi-action detection algorithm is matched with a fine CNN check classification network for use, and the dangerous behaviors of the driver are comprehensively predicted. These three processes are described in detail below.

Firstly, based on deep learning YOLOV3 framework driver makes a call, chats WeChat, drinks and grabs ears and touches the face, four dangerous behavior detection algorithm flows mainly include:

(1) and collecting driver pictures (different light rays, different time periods, different devices, different drivers, different actions and the like) of actual application scenes. The dangerous behavior classification of the collected driver pictures comprises the following steps: normal behavior, call, chat, drink, hold ear and touch face 5 types; and the rectangular boxes of 4 types are marked in the action occurrence areas of calling, chatting, drinking, grabbing ears and touching faces, the positions and the types of the actions in the whole graph are recorded, and the rectangular boxes are not required to be marked for normal actions.

(2) Pretreatment of the sample and label. And (2.5-1)/2, namely, the roi _ y is the face _ y, the roi _ w is the face _ w, and the roi _ h is 2.5 times the face _ x _ y, the roi _ w and the roi _ h as a dangerous behavior analysis area. Training sample preprocessing directly reconstructs the image truncations (roi _ x, roi _ y, roi _ w, roi _ h) and sizes normalized to 256 × 256 dimensions. Label preprocessing, the labeling label learned by YOLOV3 is the relative offset of the frame, so the real training labels for the target frame are (label _ x, label _ y, label _ w, label _ h), where label _ x is (box _ x-roi _ x)/roi _ w, label _ y is (box _ y-roi _ y)/roi _ h, label _ w is box _ w/roi _ w, label _ h is box _ h/roi _ h; class label _ class, normal behavior class 0, call class 1, chat class 2, drink class 3, and grip-to-the-ear class 4.

(3) Training of the YOLOV3 algorithm. And training the clustering selection of the network candidate boxes, and clustering the proportion of the 6 candidate boxes by a k-mean (k-means clustering algorithm is a clustering analysis algorithm for iterative solution) algorithm, wherein the 6 candidate boxes are divided into 2 network scale outputs, namely each network scale output comprises 3 proportion candidate boxes. The main network is selected from VGG-mobile Net (a classical network structure) after a large number of cutting, the size of the network parameter after cutting is about 1m, and the network structure after cutting is not disclosed. The model training uses a deep learning framework, namely, cafe, and the training content is as follows: the method is characterized in that images with the size of 256 × 256 are input to predict whether dangerous driving behaviors exist, if the dangerous driving behaviors exist, the dangerous driving behaviors are specifically selected from the behaviors of calling, chatting, drinking or touching the ear and the face, and the accurate position where the dangerous behaviors occur is predicted. As shown in fig. 2.

Secondly, designing a fine secondary check classification model as follows:

(1) and (6) collecting a sample. Sample collection is carried out by intercepting sample areas from normal labeled labeling frames of calling, chatting, drinking and touching the face with ears, and background samples are randomly intercepted from areas without dangerous behaviors in the image 4. The sample size was normalized to 128 x 128, the background label was 0, the call label was 1, the chat label was 2, the drink label was 3, and the ear-to-touch label was 4.

(2) And selecting and training a model. The classification model selects a lightcnn (existing typical classification network structure) network structure, the lightcnn network structure is simplified on the original edition, the output of the network is 5 types of background labels, namely 0, 1 for calling, 2 for chatting, 3 for drinking and 4 for touching the face. The purpose of the fine quadratic verification algorithm is to classify the dangerous motion regions predicted by YOLOV3 more accurately, and finally, the input of the network is 128 × 128 in consideration of the fact that the input images have less discriminative detail loss and the network prediction speed cannot be too slow. Because the fine quadratic verification classification network is a multi-classification network, the target learning is directly performed by using Softmax cross entropy loss, the Softmax cross entropy loss is a typical loss function in a multi-classification problem, and the operation of the Softmax cross entropy loss function is carried in a caffe framework. And (5) finishing model training when the Softmax cross entropy loss is stably converged in a smaller area. The fine quadratic verification classification model prediction flow is shown in fig. 3.

And thirdly, as shown in fig. 4, the YOLOV3 multi-action detection algorithm is used in cooperation with the fine CNN check classification network to comprehensively predict the dangerous behaviors of the driver. The specific prediction process comprises the following 4 steps:

(1) an infrared image of a driver is acquired, and a face detection algorithm detects the face position of the driver. And determining a region to be analyzed for the dangerous behavior of the driver according to the real-time face position, wherein the region selection mode is that the upper edge of the face is fixed and a downward square is enlarged by 2.5 times, and the region is intercepted and then the size of the region is normalized to 256 × 256.

(2) And predicting the normalized prediction image block by using a trained dangerous behavior YOLOV3 detection model to predict whether dangerous behaviors occur in the image block. Specifically predicting which behavior is to make a call, chat about WeChat, drink water or touch the ear and face if a dangerous behavior occurs, and outputting a probability of the predicted behavior. Otherwise, the driver is considered to be driving normally.

(3) If YOLOV3 predicts that there is a behavior of calling, talking slightly, drinking or touching the face, the program does not need to give an alarm immediately, because the hand information detected by YOLOV3 is likely to be an important learning feature and easily causes the wrong detection of the behaviors of calling, talking slightly, drinking or touching the face. Therefore, the areas detected by YOLOV3 need to be checked by finer secondary scores, and the simplified lightcnn classification probability is output.

(4) The type of alarm is determined synthetically using the YOLOV3 and lightcnn classification results. If YOLOV3 detects that a certain dangerous behavior occurs, and lightcnn performs secondary verification on the region where the certain dangerous behavior occurs, the dangerous behavior is predicted to occur, and if two prediction results occur for 5 frames at the same time, the dangerous behavior is considered to occur; if YOLOV3 detects that a certain dangerous behavior occurs, but lightcnn secondary verification is carried out on the occurrence behavior area to predict that another dangerous behavior occurs, and 10 frames of continuous occurrence of the situation alarm is that lightcnn verifies the predicted dangerous behavior; if YOLOV3 detects that some dangerous behavior occurs, but lightcnn secondary verification of the behavior occurring area does not predict that dangerous behavior occurs, the driver is predicted to be driving normally.

Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims

1. A driver dangerous action recognition method based on YOLOV3 is characterized in that: the method comprises the following steps:

2. The YOLOV 3-based driver dangerous motion recognition method as claimed in claim 1, wherein: the method further comprises a step 4 of uploading the identification result of the YOLOV3 and the identification result of the lightcnn to the cloud.

3. The YOLOV 3-based driver dangerous motion recognition method as claimed in claim 1, wherein: the step 1 is further specifically as follows: the method comprises the steps of obtaining an infrared image of a driver, detecting the face position of the driver by a face detection algorithm, fixing a downward square on the upper edge of the face, expanding a set multiple, then intercepting the area, normalizing the size of the area to a set size, and obtaining an area to be identified.

4. The YOLOV 3-based driver dangerous motion recognition method as claimed in claim 1, wherein: the training mode of the Yolov3 algorithm is as follows:

5. The YOLOV 3-based driver dangerous motion recognition method as claimed in claim 1, wherein: the lightcnn model training mode is as follows:

6. A driver dangerous action recognition system based on YOLOV3 is characterized in that: the method comprises the following steps:

7. The YOLOV 3-based driver hazard motion recognition system of claim 6, wherein: the system further comprises an uploading module which uploads the YOLOV3 recognition result and the lightcnn recognition result to the cloud.

8. The YOLOV 3-based driver hazard motion recognition system of claim 6, wherein: the identification area determining module is further specifically configured to: the method comprises the steps of obtaining an infrared image of a driver, detecting the face position of the driver by a face detection algorithm, fixing a downward square on the upper edge of the face, expanding a set multiple, then intercepting the area, normalizing the size of the area to a set size, and obtaining an area to be identified.

9. The YOLOV 3-based driver hazard motion recognition system of claim 6, wherein: the training mode of the Yolov3 algorithm is as follows:

10. The YOLOV 3-based driver hazard motion recognition system of claim 6, wherein: the lightcnn model training mode is as follows: