CN110674701A

CN110674701A - Driver fatigue state rapid detection method based on deep learning

Info

Publication number: CN110674701A
Application number: CN201910824958.5A
Authority: CN
Inventors: 路小波; 张晨
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-09-02
Filing date: 2019-09-02
Publication date: 2020-01-10

Abstract

The invention discloses a driver fatigue state rapid detection method based on deep learning, which comprises the following steps: (1) collecting a color image of a driver during driving, detecting a face part of the driver in the image by using a deep learning method, and marking the face part by using a regression frame; (2) inputting the face boundary regression frame into a multi-task learning network as input, and finally detecting face key points and the posture angle of the head of the face; (3) and establishing a space-time fatigue feature sequence by using the key points of the face and the head attitude angle, inputting the feature sequence into a fatigue recognition deep learning network by taking the feature sequence as input, and finally outputting a fatigue state recognition result. The method fully considers the real-time and accuracy requirements of the fatigue state detection of the driver, designs a corresponding optimization method on the premise of ensuring the accuracy by compressing and optimizing the fatigue characteristics of the deep learning network model, and maximally compresses the volume of the network and improves the operation speed of the algorithm.

Description

Driver fatigue state rapid detection method based on deep learning

Technical Field

The invention belongs to the field of pattern recognition, and relates to a driver fatigue state rapid detection method based on deep learning.

Background

Research shows that fatigue driving is one of the main reasons for causing road traffic accidents, and the research of a fatigue detection algorithm has great significance for improving the road traffic safety. In recent years, as the problem of road safety is increasingly emphasized, the technology for detecting the fatigue state of the driver becomes a hot research topic in the related field, and different enterprises and scientific research units develop a plurality of different detection schemes. The traditional fatigue detection method has the defects that the real-time performance is poor, the traditional fatigue detection method needs to be in contact with the limbs of a driver, the robustness is low, and the like, and the traditional fatigue detection method cannot be popularized and applied. In recent years, with the emergence of high-performance GPUs and the development of artificial intelligence chips, deep learning methods have been greatly developed in the image field, and methods based on deep learning in various fields have very good performance, so that the application of deep learning methods to embedded platforms becomes possible.

The invention provides a driver fatigue state rapid detection algorithm based on deep learning, which can be mainly used in a scene of driver fatigue driving detection, greatly reduces the volume of a neural network and accelerates the detection speed on the premise of ensuring the detection accuracy, and therefore, the algorithm can be transplanted to an embedded platform with low calculation force to meet the application requirements. The algorithm is mainly divided into three parts: a human face detection algorithm, a human face key point and head attitude angle detection algorithm and a fatigue state detection algorithm.

The human face detection algorithm is a key link in fatigue driving detection, and is used for acquiring the position and size information of a human face from an image so as to provide service for subsequent fatigue identification. The face key point detection technology has very important significance for fatigue driving detection, and the effect of fatigue detection can be seriously influenced by wrong key point positioning results. The fatigue identification method based on the physiological signal cannot be well popularized and applied due to the fact that the cost is high and the fatigue identification method needs to be in direct contact with the limbs of a driver, and the fatigue identification method based on the video image becomes a popular research direction in the field of fatigue identification at present due to the characteristics of non-contact, low cost, easiness in implementation and the like.

Disclosure of Invention

The invention aims to solve the problems and provides a driver fatigue state rapid detection algorithm based on deep learning, and the algorithm can be mainly used in a scene of driver fatigue driving detection.

In order to achieve the purpose, the method adopted by the invention is as follows: the invention discloses a driver fatigue state rapid detection algorithm based on deep learning, which at least comprises a face detection algorithm, a face key point and head attitude angle detection algorithm and a fatigue state recognition algorithm, and the invention discloses a face fatigue state detection algorithm, which comprises the following steps:

step 1: collecting a color image of a driving state, detecting a face part in the image by using a three-level cascaded deep neural network, and marking the face part by using a regression frame, wherein the specific process comprises the following steps:

step 1.1: inputting the whole image into a first-stage face candidate frame generation network, processing each window with the size of 12 multiplied by 12 in the image, and mapping through a network output layer to obtain a two-dimensional face classification vector and a four-dimensional boundary frame regression offset, wherein the offset is used for correcting a face regression frame.

Step 1.2: and zooming the image of the face candidate frame output by the first-level network into a size of 24 multiplied by 24 to be used as the input of a second-level face candidate frame coarse screening network, and outputting by using an output layer after network learning to obtain a face classification vector and a bounding box regression vector.

Step 1.3: and zooming the image of the face candidate frame output by the second-level network into 48 x 48 size as the input of a third-level face candidate frame fine screening network, and outputting the global average pooling layer by using a non-maximum suppression algorithm based on the position confidence after network learning to obtain a face positioning vector, a face classification vector and a face boundary frame regression vector.

Step 2: and (3) taking the face boundary box finally output in the step (1) as input, inputting the input into a deep learning network based on multi-task learning, and finally outputting to obtain face key points and the head pose angle of the face.

Step 2.1: a feature extraction minimum unit structure is used for replacing a traditional large convolution structure in the network, so that the network can be greatly reduced and the algorithm operation rate can be improved on the premise of ensuring that the accuracy rate is slightly reduced.

Step 2.2: and (3) scaling the face bounding box finally output in the step (1) into a size of 128 multiplied by 128, inputting the face bounding box into a deep learning network based on multi-task learning and formed by a minimum unit structure for feature extraction, and finally outputting a vector formed by 68 face key points and a vector formed by three head attitude angles (a Pitch angle Pitch, a tilt angle Yaw and a Roll angle).

And step 3: and (3) establishing a space-time fatigue feature sequence by using the key points of the face and the head attitude angle obtained in the step (2), inputting the feature sequence into a fatigue recognition deep learning network by taking the feature sequence as input, and finally outputting a fatigue recognition result.

Step 3.1: and (3) inputting the key points of the left eye and the right eye obtained in the step (2) into an eye state identification network, and outputting an eye state classification vector for indicating whether the eyes are opened or closed.

Step 3.2: and (3) performing key point position correction based on head posture inclination on the key points of the mouth obtained in the step (2), calculating the opening and closing degree of the mouth by using the corrected key points, and judging whether the mouth is in a fatigue state or a normal state of yawning by setting a threshold value of the opening and closing degree.

Step 3.3: taking the left eye and the right eye, the mouth bar opening and closing degree and the head pitch angle as fatigue features, extracting the facial fatigue features of each frame of image in the video to obtain a fatigue description feature vector with the length of 4 in each frame of image, wherein the vector is expressed as follows:

in the formulaIs left and right eye state, x_mouth(t)Is mouth openness, x_pose(t)Head pitch angle.

Step 3.4: combining a plurality of frames of image fatigue description feature vectors with a time window size from a video in a frame extraction selection mode to form a space-time fatigue feature sequence, wherein the sequence expression is as follows:

F_i＝{v_t，v_t+k，v_t+2k，...，v_t+nk}

wherein n is the length of the time window, and k is the number of the fixed interval frames of the frame extraction.

In a preferred embodiment of the present invention, the output layer of the first-level Network in step 1.1 is a Full Convolutional Network (FCN) structure, and the output layer of the second-level Network in step 1.2 is a Global Average Pooling layer (GAP) structure, and the calculation formula is as follows:

wherein f is_GAPOut(x) For the output of the global average pooling layer, M and N are the feature map size, x_ijIs the pixel value of the feature map.

The output layer of the third-level network in step 1.3 is a global average pooling layer structure of a non-maximum suppression algorithm based on the location confidence, the location confidence is defined as the overlapping rate (IoU) of the candidate bounding box and the real face box, and the expression is as follows:

P_loc＝IoU＝S(A∩B)/S(A∪B)

wherein A denotes a bounding box of an input image, B denotes a real bounding box of a human face, S denotes a region area symbol, P denotes a region area symbol_locIndicating the confidence of the positioning.

In a preferred embodiment of the present invention, the minimum feature extraction unit structure in step 2.1 includes three features: (1) dividing the standard Convolution operation into two operations of depth Convolution (Depthwise) and point-by-point Convolution (Pointwise) by using a depth Separable Convolution structure (Depthwise Separable Convolition); (2) the design uses a short connection (ShortCut) structure to connect the input signature with the output signature at the end of the cell output; (3) the LeakyReLU activation function is used in the convolutional layer to replace the traditional ReLU and Sigmoid, and the calculation formula of LeakyReLU is as follows:

in a preferred embodiment of the present invention, in step 3.4, n is 60 and k is 1.

Has the advantages that:

1. in the first-stage face candidate frame generation network, the invention uses the full convolution network to generate the face candidate frame, the full convolution network can recover the category of each pixel from the abstract characteristics, the classification of the image level is extended to the classification of the pixel level, only one-time network forward calculation needs to be executed on the whole image, and the calculation amount brought by using a sliding window can be effectively reduced.

2. In the second-level face candidate frame coarse screening network, the invention uses the structure of the global average pooling layer to replace the traditional full-connection layer structure to reduce the huge parameters brought by the full-connection layer.

3. In a third-level face candidate frame fine screening network, the invention provides a feature extraction minimum unit structure to replace a traditional large convolution structure, so as to achieve the purposes of reducing the volume of a network model and accelerating the operation speed of an algorithm, and specifically comprises the following steps:

3.1 the use of the depth separable convolution can not only greatly improve the compression rate of the network and improve the detection speed of the convolutional neural network, but also design the network deeper to improve the performance of the network on the premise of being capable of being deployed to a mobile terminal.

3.2 the invention uses a short-connection structure to connect the input characteristic diagram with the output characteristic diagram in the final design of unit output, which is equivalent to splicing the characteristic diagrams across convolution layers, so that the output of the network combines the characteristics after convolution kernel extraction and the original characteristics, thereby relieving the degradation problem in a deep network model.

3.3 the invention uses LeakyReLU activation function to replace ReLU activation function, LeakyReLU activation function multiplies a very small weight in the negative half interval of the input, so that the negative number area is no longer saturated and dead, thus avoiding the problem that the neuron in the negative interval is no longer learnt.

Drawings

FIG. 1 is an overall flow chart of the method of the present invention;

FIG. 2 is a network architecture diagram of a face detection portion of the algorithm;

FIG. 3 is a network architecture diagram of a minimum feature extraction unit;

fig. 4 is a diagram of a fatigue state recognition network.

Detailed Description

The detailed process of the invention is clearly and completely described in the following with the help of the attached drawings and the embodiment of the specification.

The flow of the human face fatigue state detection algorithm is shown in fig. 1 to 4, and the human face fatigue state detection algorithm is specifically carried out according to the following steps:

in the formula

Is left and right eye state, x_mouth(t) mouth openness, x_pose(t) is head pitch angle.

F_i＝{v_t，v_t+k，v_t+2k，...，v_t+nk}

wherein n is the length of the time window, and k is the number of the fixed interval frames of the frame extraction. The patent shows that the better scheme is n-60 and k-1 through experimental comparison.

Step 3.5: and processing a section of video, inputting the obtained space-time fatigue characteristic sequence serving as input into a fatigue recognition network based on long-time memory, and outputting a fatigue recognition result of the section of video.

The output layer of the first-level network in step 1.1 is a Full Convolutional Network (FCN) structure, and the output layer of the second-level network in step 1.2 is a Global average pooling layer (GAP) structure, and the calculation formula is as follows:

P_loc＝IoU＝S(A∩B)/S(A∪B)

The minimum feature extraction unit structure in step 2.1 includes three features: (1) dividing the standard Convolution operation into two operations of depth Convolution (Depthwise) and point-by-point Convolution (Pointwise) by using a depth Separable Convolution structure (Depthwise Separable Convolition); (2) the design uses a short connection (ShortCut) structure to connect the input signature with the output signature at the end of the cell output; (3) the LeakyReLU activation function is used in the convolutional layer to replace the traditional ReLU and Sigmoid, and the calculation formula of LeakyReLU is as follows:

in the first-level face candidate frame generation network, the traditional method uses an image pyramid sliding window method to generate face candidate frames, which requires that the candidate frames obtained by each sliding window need to be subjected to forward network calculation once to obtain the face classification confidence. The invention uses the full convolution network to generate the face candidate frame, the full convolution network can recover the category of each pixel from the abstract characteristics, the classification of the image level is extended to the classification of the pixel level, only one network forward calculation is needed to be executed on the whole image, and the calculation amount brought by using the sliding window can be effectively reduced.

In the second-level face candidate frame coarse screening network, a full-connection structure is used in the traditional method, and each node of a full-connection layer needs to be connected with all nodes of the previous layer, so that the parameter quantity contained in the full-connection layer is excessive. The invention uses the structure of the global average pooling layer to replace the traditional full-connection layer structure to reduce the huge parameters brought by the full-connection layer.

In the third-level face candidate frame fine screening network, a non-maximum suppression algorithm based on classification confidence is used for combining overlapped face frames for an obtained candidate frame set in a traditional face detection algorithm, but compared with the classification confidence, the positioning confidence is more closely associated with a real target boundary frame, so that the target positioning boundary frame obtained by the non-maximum suppression algorithm based on the positioning confidence is relatively more accurate.

In addition, in order to compress the volumes of the face candidate frame and the head pose angle network, the invention provides a feature extraction minimum unit structure to replace the traditional large convolution structure so as to achieve the purposes of reducing the volume of a network model and accelerating the operation speed of an algorithm. The structure makes the following improvements:

in the standard convolution, each convolution kernel can simultaneously perform the same convolution operation with each channel of the input picture, while in the deep convolution operation, one convolution kernel only performs the convolution operation with one channel, and the feature maps are recombined to generate a new feature map in a point-by-point convolution mode. The depth separable convolution can greatly improve the compression rate of the network and the detection speed of the convolutional neural network, and the network can be designed to be deeper on the premise of being deployed to a mobile terminal to improve the performance of the network.

The deeper the number of layers of the convolutional neural network in the traditional CNN model is, the smaller the gradient of the layer closer to the former network is due to the propagation of the gradient from back to front, so that the problem of gradient disappearance is easily caused, and the deep convolutional neural network is difficult to train. According to the method, the input feature diagram and the output feature diagram are connected by using a short connection structure in the final design of unit output, which is equivalent to splicing the feature diagrams across convolution layers, so that the output of the network is combined with the features extracted by the convolution kernel and the original features, and the degradation problem in a deep network model is relieved.

The activation function used in the standard convolutional layer is a ReLU activation function, when the input value is negative, the output of the ReLU function is always 0, and the first derivative of the ReLU function is also always zero, which may result in that the parameters of the neuron cannot be updated, and when the convolutional neural network is used to perform the function approximation application of the key point detection, the fitting ability of the network may be reduced. According to the invention, the LeakyReLU activation function is used for replacing the ReLU activation function, and the LeakyReLU activation function is multiplied by a small weight in the input negative half interval, so that the negative number region is not saturated and died any more, and the problem that neurons in the negative interval are not learned any more is avoided.

Claims

1. A driver fatigue state rapid detection method based on deep learning is characterized by comprising the following steps:

step 1.1: inputting the whole image into a first-stage face candidate frame generation network, processing each window with the size of 12 multiplied by 12 in the image, mapping through a network output layer to obtain a two-dimensional face classification vector and a four-dimensional boundary frame regression offset, wherein the offset is used for correcting a face regression frame;

step 1.2: zooming the image of the face candidate frame output by the first-level network into 24 x 24 size as the input of the second-level face candidate frame coarse screening network, and obtaining a face classification vector and a bounding box regression vector by using the output layer output after network learning;

step 1.3: zooming the image of the face candidate frame output by the second-level network into 48 x 48 size as the input of a third-level face candidate frame fine screening network, and outputting the global average pooling layer by using a non-maximum suppression algorithm based on position confidence after network learning to obtain a positioning vector, a face classification vector and a face boundary frame regression vector of the face;

step 2: inputting the face bounding box finally output in the step 1 as input into a deep learning network based on multi-task learning, and finally outputting to obtain face key points and a head pose angle of the face, wherein the method specifically comprises the following steps:

step 2.1: replacing the traditional large convolution structure with a feature extraction minimum unit structure in the network;

step 2.2: the face bounding box finally output in the step 1 is scaled to be 128 x 128 in size, and is input into a deep learning network based on multi-task learning and formed by a minimum unit structure for feature extraction, and a vector consisting of 68 face key points and three head pose angles are finally output: a Pitch angle Pitch, a vector consisting of an inclination angle Yaw and a swing angle Roll;

and step 3: and (3) establishing a space-time fatigue feature sequence by using the face key points and the head attitude angles obtained in the step (2), inputting the feature sequence into a fatigue recognition deep learning network by taking the feature sequence as input, and finally outputting a fatigue recognition result, wherein the method specifically comprises the following steps:

step 3.1: inputting the key points of the left eye and the right eye obtained in the step 2 into an eye state identification network, and outputting an eye state classification vector for indicating whether the eyes are opened or closed;

step 3.2: performing key point position correction based on head posture inclination on the key points of the mouth obtained in the step 2, calculating the opening degree of the mouth by using the corrected key points, and judging whether the mouth is in a fatigue state of yawning or a normal state by setting a threshold value of the opening degree;

in the formula

Is left and right eye state, x_mouth(t)Is mouth openness, x_pose(t)Is a head pitch angle;

F_i＝{v_t，v_t+k，v_t+2k，...，v_t+nk}

2. The method for rapidly detecting the fatigue state of the driver based on the deep learning as claimed in claim 1, wherein: the output layer of the first-level network in the step 1.1 is a full convolution network structure, the output layer of the second-level network in the step 1.2 is a global average pooling layer structure, and the calculation formula is as follows:

wherein f is_GAPOut(x) For the output of the global average pooling layer, M and N are the feature map size, x_ijIs the pixel value of the feature map;

P_loc＝IoU＝S(A∩B)/S(A∪B)

3. The method for rapidly detecting the fatigue state of the driver based on the deep learning as claimed in claim 1, wherein: the minimum feature extraction unit structure in step 2.1 includes three features: (1) the standard convolution operation is divided into two operations of depth convolution and point-by-point convolution by using a depth separable convolution structure; (2) the design uses a short connection structure to connect the input characteristic diagram with the output characteristic diagram at the end of the unit output; (3) the LeakyReLU activation function is used in the convolutional layer to replace the traditional ReLU and Sigmoid, and the calculation formula of LeakyReLU is as follows:

4. the deep learning-based rapid detection method for the fatigue state of the driver according to claim 1, characterized in that: in step 3.4, n is 60 and k is 1.