CN113408389A

CN113408389A - Method for intelligently recognizing drowsiness action of driver

Info

Publication number: CN113408389A
Application number: CN202110650708.1A
Authority: CN
Inventors: 唐明伟; 李林熹; 赵潇然; 毛红运; 曾晟珂; 陈晓亮; 何明星; 徐杨胜; 王鹏程; 王刘萱; 蒙科竹; 陶林平; 田佳鑫; 蒋一铭; 杨凌
Original assignee: Xihua University
Current assignee: Xihua University
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-09-17

Abstract

The invention provides a method for intelligently identifying drowsiness actions of a driver, which comprises the following steps: the method comprises the following steps: acquiring a video stream in the driving process of a driver; step two: preprocessing the video stream to obtain gray image information and optical flow image information of the video stream; step three: and taking the gray scale image information and the optical flow image information as the input of a drowsiness action recognition model, thereby obtaining the drowsiness action of the driver. The invention takes the optical flow image as the input of the drowsiness action detection model, and the optical flow image stores the motion information of the object, thereby further improving the detection accuracy.

Description

Method for intelligently recognizing drowsiness action of driver

Technical Field

The invention relates to the technical field of image processing, in particular to a method for intelligently identifying drowsiness actions of a driver.

Background

In recent years, deep learning has exhibited excellent performance in various applications. CNN, also known as two-dimensional convolutional neural network (2D-CNN), is one of the most powerful deep learning algorithms in image recognition and classification. In the field of drowsiness detection, Zhao et al propose a drowsiness detection method based on CNN. They extract a face region from an image and classify the images according to the state of eyes using the proposed CNN model, and then determine whether the subject is drowsy according to PER-CLOS values. By using the CNN for feature extraction, the drowsiness detection accuracy is significantly improved. However, since the 2D-CNN convolves only the width and height of the image and does not include temporal features, there are still limitations. That is, 2D-CNNs do not take into account motion information contained in a continuous sequence of frames. When the driver frequently yawns or dozes, it indicates that the driver is asleep. These actions are dynamic and cannot be reflected in a single image, whereas 2D-CNN can only determine the static state of the driver. However, the eye-closing state may be an image of a normal blink of the driver, or may be a frame of a slow blink action of the driver, where the slow blink represents a state in which the driver has slowed down the blink speed due to fatigue or has fallen asleep with eyes directly closed. The duration of the eye-closed state at the time of slow blinking is longer than that at the time of normal blinking, which is not reflected in the two-dimensional image. To extract temporal features from a continuous sequence of frames, therefore, a three-dimensional convolutional neural network (3D-CNN) is proposed that integrates spatiotemporal information into a single model to capture discriminative features in the spatiotemporal dimension. However, the accuracy of the three-dimensional convolutional neural network detection is not high enough.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a method for intelligently identifying drowsiness actions of a driver with higher identification precision.

A method for intelligently recognizing drowsiness actions of a driver comprises the following steps:

the method comprises the following steps: acquiring a video stream in the driving process of a driver;

step two: preprocessing the video stream to obtain gray image information and optical flow image information of the video stream;

step three: and taking the gray scale image information and the optical flow image information as the input of a drowsiness action recognition model, thereby obtaining the drowsiness action of the driver.

Further, according to the method for intelligently recognizing the drowsiness action of the driver as described above, in the second step, the preprocessing includes:

step 2-1: dividing the video stream into a plurality of video segments at certain time intervals;

step 2-2: extracting image information of the video clip according to frames, and respectively converting the image information of each frame to obtain a gray level image sequence;

step 2-3: and calculating optical flow information between two frames according to the optical flow between adjacent frames, thereby obtaining an optical flow image sequence.

Further, in the method for intelligently identifying the drowsiness action of the driver as described above, the time interval in step 2 is to intercept the video clip every 3 seconds.

Further, according to the method for intelligently recognizing the drowsiness of the driver as described above, the image size of each frame is 224 × 224.

Further, the method for intelligently recognizing the drowsiness action of the driver as described above includes: normal driving, nodding, slow blinking, yawning.

Further, the method for intelligently recognizing drowsiness of driver as described above, wherein the drowsiness recognition model comprises: the device comprises a convolution layer, a pooling layer, a full-connection layer and a softmax classifier which are connected in sequence;

wherein the convolutional layers comprise 4, the first convolutional layer comprises 8 convolutional kernels, and the size of each convolutional kernel is 3 x 3; the second and third convolutional layers have 16 convolution kernels, the fourth convolutional layer has 8 convolution kernels;

the full connection layer comprises two full connection layers, namely fc1 full connection layer and fc2 full connection layer; the number of the neurons of the fc1 full connection layer is 23520, and the number of the neurons of the fc2 full connection layer is 64.

Has the advantages that:

the invention provides a drowsiness action recognition method based on 3D-CNN, which can extract spatial and temporal features from an input image sequence and is beneficial to action recognition.

The invention also uses the optical flow image as the input of the drowsiness action detection model. The optical flow image stores the motion information of the object, so that the detection accuracy is further improved.

The method provided by the invention can identify a plurality of drowsiness actions, and the experimental result shows that the classification accuracy of the model reaches 86.6%, which is competitive with the existing method.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a conceptual diagram of time IoU;

FIG. 3 is a diagram of a drowsiness action recognition model;

FIG. 4 is a schematic view of an optical flow image;

FIG. 5 is a graph comparing the effect of data enhancement and optical flow on model accuracy;

FIG. 6 is a graph comparing the effect of data enhancement and optical flow on model loss.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention are described clearly and completely below, and it is obvious that the described embodiments are some, not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention is improved on the basis of a three-dimensional convolution neural network model, and an optical flow image of object motion information is input into the model for training.

As shown in FIG. 1, the invention provides a drowsiness behavior recognition model based on a three-dimensional convolutional neural network for driver drowsiness detection research. The model can recognize four different actions, including a non-drowsy action (normal driving) and three drowsy actions (nodding, slow blinking, yawning), and the input of the model is a 10-frame grayscale image sequence and a 9-frame optical flow image sequence. First, 10 frames are extracted from a 3-second video clip, and each frame is converted into a grayscale image. In addition, we also calculate the optical flow between adjacent frames. And inputting the gray image sequence and the optical flow image sequence into a pre-trained drowsiness action recognition model. A 1 x 4 vector is obtained to represent the probability of each class.

Data pre-processing

The acquired video stream needs to be pre-processed first. The drowsiness action recognition model is the core of the proposed drowsiness detection scheme. It can recognize four behaviors including normal driving, nodding, slow blinking, and yawning. Since the model is based on a three-dimensional convolutional neural network, the input to the model is a sequence of frames, so it is necessary to extract successive frames from the video. Each video in the National Tsing Hua University Driver Detection (NTHU-DDD) data set lasts about 1 minute, with 1800 frames, which is too voluminous as a model input due to the limited computational power of the experimental equipment. Therefore, we need to clip these videos into video segments. The clipped video segment cannot be too long because the video segment is too long and contains multiple drowsiness actions, so that the model does not converge during training. Nor too short, which would not capture the characteristics of this action, because a drowsiness action may contain multiple phases, such as yawning, which includes an open mouth phase, a mouth-to-maximum and hold phase, and a closed mouth phase, which causes the model to recognize different phases of these same action as different actions. For this reason, we count the duration of each drowsiness action in the partial video, and find that the duration of these actions is above 3 seconds. To ensure that all drowsiness actions are detected, we set the duration of the video segment to 3 seconds and apply this setting to the training set and test set of NTHU-DDD.

We used the concept of time IoU to convert frame-level annotations in the ntuu-DDD dataset into fragment-level annotations. The concept of time IoU is shown in FIG. 2. We designate as the annotation value for the segment values that account for more than 50% of the frame-level annotation value. Furthermore, we extract 10 frames of grayscale images from a 3-second video segment at average time intervals, and then calculate optical flow between two adjacent frames using the LK optical flow method to capture motion information of the object. Each frame image is then resized to 224 x 224 and input into the model. The most common method to prevent model overfitting is to artificially enlarge the data set using label-preserving transformations. In the present invention, we apply image level flipping and luminance contrast transformation to augment the data set. The data generated by data enhancement can enlarge the number of training samples, increase the diversity of the training samples, simultaneously prevent overfitting and improve the performance of the model.

Drowsiness action recognition model

The input of the 3D-CNN is an image sequence instead of a single image, so that the convolution kernel and the generated feature map are three-dimensional, and the output feature map contains space-time feature representations. Formally, a point at position (i, j, k) in the feature map of the (l + 1) th layer is denoted as P^l+1(i,j,k)，

P^l+1(i,j,k)＝α(f(i,j,k)+b) (1)

Where α represents the activation function and n represents the number of profiles. w, h and d represent the convolution kernel length, and the height and depth w (x, y, z) represent the weight of the (x, y, z) position in the convolution, respectively. s represents a step size, and b represents an offset amount.

The drowsiness action recognition model provided by the invention is based on 3D-CNN, and comprises two input streams, a 10-frame gray image sequence and a 9-frame optical flow image sequence. The grayscale image stream passes through four convolutional layers and four pooling layers. Similarly, the optical flow image stream also passes through four convolutional layers and four pooling layers. The two outputs are then converted to a one-dimensional vector and connected to the fully-connected layer. And finally, obtaining a classification result of the input image sequence, wherein the model architecture is shown in fig. 3, gray represents a gray image sequence input stream, and flow represents an optical flow image input stream. 224 × 224 × 10 denotes the input image width and height as 224 × 224, and 10 denotes the depth (frame number) of the input image sequence. C_i-jK @ w × h × d indicates that the jth convolutional layer of the ith input stream has k convolutional kernels with the size w × h and the depth d. M @ w × h × d on the left side of the cube in the figure indicates that the current layer has m feature cubes (corresponding to feature maps in the two-dimensional convolution) with a size of w × h × d. The first convolutional layer contains 8 convolutional kernels, each of which has a size of 3 × 3 × 3, the second and third convolutional layers have 16 convolutional kernels, and the fourth convolutional layer has 8 convolutional kernels. S_i-jK @ w h d indicates that the pooling size of the jth pooling layer of the ith input stream is w h d. fc representsAnd a connecting layer (full-connected layer) which stretches the fully learned feature cubes into one dimension, has 23520 neurons in total, is connected with 64 neurons in the fc2 layer, is finally connected to 4 neurons in the output layer, and obtains a final classification result through a softmax classifier.

LK optical flow method

Because the three-dimensional convolution has one dimension in time more than the two-dimensional convolution, the model can well learn the characteristic representation of the input image in time and space, and the previous research proves that the three-dimensional convolution can be well applied to the drowsiness detection research field. The invention introduces the optical flow method into the drowsiness action recognition model, and the experimental result shows that the optical flow image is really helpful for recognizing the drowsiness action.

The concept of optical flow was first proposed by Gibson in 1950 as the instantaneous velocity of pixel motion in spatially moving objects in the imaging plane. And finding the corresponding relation between the current frame and the previous frame by utilizing the change of the pixel values in the image sequence and the correlation between the adjacent frames, thereby calculating the motion information of the object between the adjacent frames. In general, optical flow is caused by movement of the foreground object itself, movement of the camera, or movement of both in the scene. The optical flow algorithm evaluates the deformation between two images, with the basic assumption that the pixel values of moving objects do not change in the image sequence. Based on this assumption, we can derive the constraint equation for the image:

I(x,y,t)＝I(x+Δx,y+Δy,t+Δt) (3)

wherein (x, y) represents the coordinates of the pixel point in the image, I (x, y, t) represents the brightness of the pixel point I with the time position of (x, y) t, Δ x represents the displacement of the pixel I in the horizontal direction after the time Δ t, and Δ y represents the displacement of the pixel I in the vertical direction after the time Δ t.

The right part of equation (3) is developed using taylor's equation to yield:

wherein

Represents the partial derivative of I with respect to x,

represents the partial derivative of I with respect to y,

the partial derivative of I with respect to t is indicated. H.o.t represents the higher order terms of the taylor equation expansion, which can be ignored here as 0 because of the small motion assumption of the optical flow method. Equation (3) and equation (4) are available:

equation (5) both sides of the equation are divided by Δ t simultaneously to obtain:

order:

then there are:

I_xV_x+I_yV_y＝-I_t1 (8)

in the formula (8), there is only one equation with two unknowns V_xAnd V_yAnd therefore cannot be solved. Therefore, according to the new assumption proposed by Lucas and Kanade et al, the spatial consistency is that the motion directions of the pixel points in the neighborhood range of the target pixel are consistent, and the algorithm is named as LK optical flow method. Assuming that the neighborhood range is m × m, there are:

where n is the nth pixel in the window. Equation (7) can be expressed in a matrix form:

Av＝-b (10)

wherein

Equation (10) is an overdetermined system of equations, because the number of equations is greater than the number of unknowns, we can solve the overdetermined system of equations by using the least squares method:

A^TAv＝A^T(-b) (11)

wherein A is^TIs the transpose of matrix a. Then multiplying both sides of equation (11) by A on the left^TInversion of A:

v＝(A^TA)^-1A^T(-b) (12)

solving to obtain:

the finally obtained vector (V)_x,V_y) Namely the light stream calculated by the L-K light stream. In the present invention, a window size of 5 × 5 is preferable, and the generated optical flow image is as shown in fig. 4.

Results of the experiment

We first trained a training set containing 6918 image sequences from the training set in ntuh-DDD, which together contained simulated driving data for 18 subjects, into a drowsiness behavior recognition model. The 1460 image sequences of the validation set were all from the validation set of NTHU-DDD, and contained simulated driving data for a total of 4 subjects. The data distribution in the training set is shown in table 1.

In fact, the normal driving category is the highest-occupied category in the entire data set, and is approximately 60%. In order to balance the proportion of drowsy and non-drowsy actions in the data set, we discarded 20% of the data in the normal driving category and controlled the proportion of drowsy and non-drowsy action data to be around 1: 1. In addition, the proportion of the five scene data in the training set is close to 1:1:1:1: 1. Finally, we extend the data in the training set by the data enhancement method mentioned in 3.2, we apply horizontal flip transform to the data in the training set, change brightness and contrast, and finally obtain 27672 image sequences as the training set.

TABLE 1 training set data distribution

Effect of optical flow and data enhancement on model

The drowsiness action recognition model provided by the invention comprises two input streams: a grayscale image sequence and an optical flow image sequence. From the above description of the LK optical flow method, it has been understood that the optical flow image holds the object motion information that is beneficial for drowsiness motion recognition, and we calculate the optical flow between two adjacent frames and input it into the model. The data enhancement and the optical flow influence the performance of the model, the classification accuracy of the model is improved, the influence of the accuracy of the two factor models is quantified through an experimental result, and the main reason of the accuracy improvement is determined due to the introduction of the optical flow. We performed four sets of comparative experiments using the controlled variable method: (1) training using only raw data; (2) applying data enhancement only to the original data set; (3) adding an optical flow image on the basis of an original data set as model input; (4) data enhancement is applied to the original data set while adding optical flow as an input to the model. The other parameters of the four experiments are set to be consistent, the optimization method is Adam, and the learning rate is set to be 0.001. The effect of the final data enhancement and optical flow images on model accuracy is shown in FIG. 5 and the effect on the loss of the model training process is shown in FIG. 6.

Figures 5 and 6 show the variation of accuracy and loss for the 20 epoch models of the four control experiments. As can be seen from the figure, the accuracy of the model fluctuates greatly during the training process when only the original data set is used for training. Although the accuracy of the final model is not much different from that of the model enhanced by the application data, the model trained by only using the original data fluctuates from the loss of the 6 th epoch starting model by combining the lost change curve in the model training process, and rebounds after the loss is reduced, and the final loss is not reduced, so that the conclusion can be drawn that the model trained by only using the original data is already fitted. In addition, it can be seen that the accuracy of the model using both data enhancement and optical flow is up to 86.6%. While the model accuracy using only data enhancement is 79.1%, the model accuracy using only optical flow is 81.7%, and the effect of visible optical flow on the model accuracy is more significant than the data set enhancement.

F1 score

The F1 score is an index used to measure the accuracy of the binary classification model in statistics, taking into account the accuracy and recall of the classification model. The F1 score can be viewed as a harmonic mean of model accuracy and recall with a maximum of 1 and a minimum of 0. It is calculated by the following formula:

the accuracy and recall may be calculated from the confusion matrix in table 2. True examples (TP) in the table indicate that the predicted example is actually Positive, and the prediction is also Positive; the True Negative case (TN, True Negative) indicates that the actual is Negative and the prediction is also Negative; false Positive case (FP) indicates that actually negative, but predicted Positive; false Negative (FN) indicates that actually positive, but predicted Negative. The accuracy and recall calculation is as follows:

TABLE 2 confusion matrix

Table 3 verifies the accuracy and recall of the different categories on the set

Table 3 shows the model's accuracy and recall on the validation set. The precision rate of the yawning class is the highest, because the yawning action comprises the processes of opening and closing the mouth, compared with other classes, the yawning action features are more obvious, the facial expression changes more greatly, and the model learns the features more easily. The recall rate and the accuracy rate of slow blinking are lowest, because the slow blinking is only related to the movement of eyes, the face proportion of the eye area is small, the recognition is influenced, and the recognition difficulty is larger in scenes with glasses and sunglasses. As can be seen from the table, the data of the slow blinking category are 699, of which 137 are predicted to be the normal driving category, which means that the model is easy to confuse the normal driving category and the slow blinking category, and the ability to distinguish the two categories is to be improved. However, the classification accuracy of the entire model was good, and the F1 score of the model was 0.861 by calculation.

NTHU-DDD is the most widely used drowsiness detection dataset, and there are many algorithms that compare the dataset to which it is a reference with other algorithms. However, these algorithms process the NTHU-DDD data set differently, some convert the video into images and then train using frame-level annotations in the data set, and some process the data set into video segments and then train after converting the frame-level annotations into segment-level annotations. The latter method employed by the present invention uses frame-level labeling for training, thus selecting the existing method using the same frame-level labeling for comparison.

Yu et al propose a model based on a three-dimensional convolutional neural network for drowsiness detection, which consists of three modules, representing learning, scene understanding and feature fusion, respectively. The model generates a spatiotemporal representation from a plurality of successive frames and analyzes scene conditions to define head, eye, mouth movements. And then using the analysis result of the scene condition understanding model as auxiliary information of sleep detection. Finally, the method generates fusion features using the spatio-temporal representation of the scene conditions and the classification results. By fusing features, it is proved that the proposed method can improve the performance of sleep detection. The method uses two feature fusion strategies: IAA (independent-Averaged Architecture) and FFA (Feature-Fused Architecture). Then, a Condition-Adaptive Learning Framework (CARLF) is proposed, which contains 4 models: spatio-temporal representation learning, scene understanding, feature fusion and drowsiness detection. Features extracted by spatiotemporal representation learning can describe motion in a video. The scene condition understanding classifies various conditions of the driver and scene conditions related to driving conditions, such as a state of wearing glasses, a lighting condition of driving, movement of face elements such as a head, eyes, and a mouth, and the like. Feature fusion generates a conditional adaptive representation using two features extracted from the model described above. The drowsiness detection model uses a condition-adaptive representation to identify the drowsiness state of the driver. The condition self-adaptive characterization learning framework can extract more distinguishing features for each scene condition, so that the sleepiness detection method can provide more accurate results for various driving conditions. A comparison of the proposed method with the above method results is shown in table 4. It can be seen that the method provided by the invention is obviously superior to the existing method, and the model learns more characteristics beneficial to motion recognition due to the introduction of optical flow.

Table 4 compares the results with the prior art method

Temporal complexity of model

Due to the introduction of the three-dimensional convolutional neural network, the calculation amount of the model is greatly increased compared with that of the two-dimensional convolutional neural network. Theoretically, the time complexity of the drowsiness action detection model based on the three-dimensional convolution neural network is

Where d represents the number of model convolution layers. W_i，H_i，D_iThe width, height and depth (number of frames in the image sequence) of the input feature map for each convolutional layer are shown, respectively. n is_i，m_iAnd k_iRepresenting the width, height and depth of the 3-dimensional convolution kernel. The model consumes a large amount of computing resources during training, but only 0.2 second is needed for predicting a 3s video segment during prediction, and the time spent on computing the optical flow is added, so that the frame rate of the whole scheme is 25.3fps, and the standard of real-time detection is basically met.

In conclusion, the optical flow images are used as input streams and are put into a drowsiness action recognition model based on a three-dimensional convolutional neural network for feature extraction, and the final experimental result shows that the model well learns the motion information contained in the drowsiness action recognition model. The scheme firstly extracts 10 frames of images from a 3-second video segment according to an average interval, converts the 10 frames of images into gray images and calculates optical flow between two adjacent frames by using an LK optical flow method. Then inputting the 10 frames of gray level image sequence and the 9 frames of optical flow image sequence into a pre-trained drowsiness action recognition model, and because the model is based on a three-dimensional convolution neural network, compared with a two-dimensional convolution neural network, the model has one dimension in time, so that the time characteristic representation in the image sequence can be better extracted, and the action recognition is facilitated. The model can identify four classes of actions including a non-drowsiness action (a normal driving class) and three drowsiness actions (a yawning class, a slow blinking class and a nodding class), and finally, through calculation of the model, an input image sequence is converted into a 1 x 4 vector which respectively represents the probability that the input image sequence is of each class, and whether the input contains the drowsiness action is finally determined.

Experimental results show that the final classification accuracy of the method provided by the invention is 86.6%, and the drowsiness detection of the driver can be effectively carried out. Compared with the prior method based on the vehicle parameters, the method provided by the invention has higher detection accuracy. Compared with the method based on physiological parameters, the method provided by the invention is more convenient, the electrode does not need to be connected to the body of a driver, the normal driving of the driver is prevented, and the equipment cost is lower. The prior computer vision detection method based on manual characteristics commonly uses a PERCLOS method and a FOM method for drowsiness detection, but generally one algorithm can only detect one drowsiness action, and the detection efficiency is not high. The method provided by the invention only needs to train the drowsiness action recognition model in advance, one model can detect a plurality of drowsiness actions, and the detection efficiency is higher.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for intelligently recognizing drowsiness actions of a driver is characterized by comprising the following steps:

2. The method for intelligently recognizing the drowsiness of the driver as claimed in claim 1, wherein the preprocessing in the second step comprises:

3. The method for intelligently identifying the drowsiness of driver as claimed in claim 1, wherein the time interval in step 2 is to intercept video clips every 3 seconds.

4. The method for intelligently identifying drowsiness of a driver according to claim 1, wherein each frame of image has a size of 224 x 224.

5. The method for intelligently recognizing the drowsiness of the driver according to claim 1, wherein the drowsiness of the driver comprises: normal driving, nodding, slow blinking, yawning.

6. The method for intelligently recognizing the drowsiness of the driver according to claim 1, wherein the drowsiness recognition model comprises: the device comprises a convolution layer, a pooling layer, a full-connection layer and a softmax classifier which are connected in sequence;