CN110502995B

CN110502995B - Driver yawning detection method based on fine facial action recognition

Info

Publication number: CN110502995B
Application number: CN201910658690.2A
Authority: CN
Inventors: 闵卫东; 杨浩; 韩清; 熊辛; 张愚; 汪琦
Original assignee: Nanchang University
Current assignee: Nanchang University
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2023-03-14
Anticipated expiration: 2039-07-19
Also published as: CN110502995A

Abstract

The invention provides a fine facial action recognition-based yawning detection method for a driver, which comprises the following steps of: step 1, preprocessing a driver driving video captured by a vehicle-mounted camera, detecting and segmenting a human face, and normalizing and denoising the size of an image; step 2, a key frame extraction algorithm is provided, and key frames in the fine action sequence are extracted through a method combining picture histogram similarity threshold screening and outlier similarity picture elimination; step 3, according to the selected key frames, establishing a 3D deep learning network (3D-LTS) with a low time sampling rate to detect various yawning behaviors, extracting key frames of fine actions through a key frame extraction algorithm, and then extracting space-time characteristics and detecting various facial fine actions through the established 3D-LTS network; the method is superior to the existing method in the aspects of recognition rate and overall performance, can effectively distinguish yawning and other facial subtle actions, and effectively reduces the false detection rate of yawning actions of a driver.

Description

Driver yawning detection method based on fine facial action recognition

Technical Field

The invention relates to the technical field of computer vision, in particular to a yawning detection method for a driver based on fine facial action recognition.

Background

Intelligent driving, including providing early warning signals, monitoring and assisting vehicle control, are a popular research topic in recent years to improve road safety. Thousands of people die or are severely injured each year by drivers falling asleep in the car. Road safety is severely threatened by driver fatigue. The national highway traffic safety administration performed investigations and showed that more than one third of the respondents acknowledged experiencing fatigue while driving. In a fatigue driving accident, 10% of people acknowledge that they have had such accidents in the past month and year. Researchers found that driver fatigue caused 22% of traffic accidents. Without any warning, driving fatigue results in a collision or near-collision with six times the probability of normal driving. Therefore, it is very important to research a method of recognizing driver fatigue to improve road safety. Over the past decades, many driver fatigue detection methods have been proposed to help drivers drive safely and to improve traffic safety. The behavioral characteristics of the driver in fatigue driving include blinking, nodding, closing eyes and yawning. Among these behaviors, yawning is one of the main forms of fatigue manifestation. Therefore, researchers have made extensive studies on yawning detection. Compared to conventional whole-body motion recognition, facial motion can be considered as subtle facial motion.

Although many researchers have proposed different methods to detect yawning, they still face significant challenges. Due to the complex facial actions and expressions of the driver in the real driving environment, the existing method is difficult to accurately and steadily detect yawning, and particularly, when mouth deformation of some facial actions and expressions is similar to the yawning, false detection is easy to occur. Therefore, in the face of new characteristics and new challenges of a driving environment, how to quickly and accurately detect yawning behaviors of a driver is a topic to be researched.

Disclosure of Invention

The invention aims to solve the problem that the existing detection algorithm for yawning actions of a driver cannot effectively distinguish some special yawning actions and yawning-like actions, such as special yawning actions like yawning when singing a song, shouting-like yawning actions like shouting, and provides a method for detecting yawning of the driver based on fine facial action recognition.

In order to achieve the purpose, the invention provides the following technical scheme: the fine facial action recognition-based yawning detection method for the driver comprises the following steps of:

step 1, preprocessing a driver driving video captured by a vehicle-mounted camera, detecting and segmenting a human face, and normalizing and denoising the size of an image;

step 2, a key frame extraction algorithm is provided, and key frames in the fine action sequence are extracted through a method combining picture histogram similarity threshold screening and outlier similarity picture elimination;

and 3, establishing a 3D deep learning network (3D-LTS) with a low time sampling rate according to the selected key frame to detect various yawning behaviors.

Further, the preprocessing the driver driving video captured by the vehicle-mounted camera includes: and detecting the face area of the driver by adopting a Viola-Jones face detection algorithm, segmenting the face area of the driver, and denoising by adopting a rapid median filtering method.

Further, the key frame extraction algorithm is from a series of original video frames F = { F = { F } _j Extracting a series of key frames K = { K } from j =1, \8230 _i I =1, \ 8230;, M }; where M denotes the number of key frames selected from the original frames and N denotes the number of original frames, the key frame extraction algorithm comprising two selection stages:

in a first selection stage, calculating an RGB color histogram of each video frame; then, the Euclidean distance is used for calculating the color histogram gamma of two continuous frames _j And gamma _j+1 Similarity between them:

wherein j is more than or equal to 1 and less than or equal to N-1, N is the dimension of the image color histogram.

Calculating a similarity threshold T by formula (2) _s ：

T _s ＝μ _s (2)

Wherein, mu _s Is Mean (S), S is S _j When set of S _j ＞T _s When it is, consider F _j And F _j+1 The similarity is smaller, will F _j And adding the candidate key frame queue.

In the second selection stage, candidate key frame pictures with outlier features in the candidate key frames are removed to obtain a final key frame, and two image similarity measurement indexes are used: euclidean Distance (ED) and Root Mean Square Error (RMSE), using the Mean Absolute Deviation (MAD) to detect frames with outliers, the MAD being calculated according to equation (3):

MAD＝median(|X _i -median(X)|) (3)

recording two consecutive candidate key frames as K _i，i+1 For all K _i，i+1 Calculating their RMSE and ED values, for each RMSE (K) calculated _i，i+1 ) And ED (K) _i，i+1 ) Their MAD values are calculated and are denoted α = MAD (RMSE), β = MAD (ED), RMSE (K) _i，i+1 ) Is shown in formula (4), ED (K) _i，i+1 ) The formula (5) is shown in the formula (5), when RMSE (K) _i，i+1 ) Less than alpha and ED (K) _i，i+1 ) When it is smaller than beta, K is considered _i Is a candidate key frame with outlier features and combines K _i The candidate key frames are removed.

Where n represents K _i The size of (c).

Where m represents the picture color histogram dimension of the candidate keyframe.

Further, the 3D-LTS network is used for spatio-temporal feature extraction and fine motion recognition, the 3D-LTS network uses 8 non-superframe frames as input, uses four 3D convolutional layers to extract spatio-temporal features from consecutive frames, all convolutional filters are 3 × 3 × 3, steps are 1 × 1 × 1, all pool layers are the largest pools, the kernel size of the first and second gathering layers is 1 × 2 × 2, the number of filters of the first four convolutional layers is 32, 64, 128 and 256, the third pool kernel layer is 2 × 4 × 4, the convolutional layers are followed by a fully connected layer for mapping features, and a fully connected layer with 1024 outputs is used for distribution of integrated features.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a fine facial action recognition-based yawning detection method for a driver, which comprises the following steps of firstly, providing a two-stage key frame extraction algorithm, wherein the algorithm has the advantages of high calculation speed and capability of effectively extracting a fine action key frame from an original frame sequence; secondly, the invention also provides a fine motion recognition network based on the three-dimensional convolution network, which is used for extracting space-time characteristics and detecting various facial fine motions; the method provided by the invention is superior to the existing method in the aspects of identification rate and overall performance, can effectively distinguish yawning from other facial fine actions, and effectively reduces the false detection rate of yawning actions of a driver.

Drawings

FIG. 1 is a block diagram of the present invention for fine facial motion recognition based yawning detection for a driver;

FIG. 2 is a representation of key frame extraction results in accordance with the present invention;

FIG. 3 is a comparison of two-dimensional convolution and three-dimensional convolution;

fig. 4 is a structural diagram of a 3D-LTS network proposed by the present invention;

FIG. 5 is a sample of some frames from the YawDDR dataset;

FIG. 6 is a sequence of images of two actions in the YawDDR dataset (a) speaking (b) yawning;

FIG. 7 is a high definition camera and driver position map of the present invention;

fig. 8 is an image sequence of three facial movements in the MFAY dataset of the present invention (a) shouting yawning (b) singing (c);

FIG. 9 is a graph of video sequences in the MFAY data set of the present invention;

fig. 10 is a graph of the results of the present invention method and four advanced methods on the MFAY dataset.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. The embodiments described herein are only for explaining the technical solution of the present invention and are not limited to the present invention.

The invention provides a technical scheme that: the overall framework of the yawning detection method for the driver based on the fine facial motion recognition is shown in figure 1, and the yawning detection method comprises the following steps of:

Video preprocessing is an important step in the research content of the people, and the requirement of real-time performance is required because the yawning detection of the driver relates to the fatigue driving of the driver. Faster and better video processing techniques must be employed to process the video recorded by the onboard camera. We first frame the video. The framed pictures contain redundant information of a plurality of backgrounds and the like, and the information is not useful for subsequent classification of the pictures, and can cause great interference. Our goal is to classify the actions of the driver's face, so our region of interest should be the driver's face region. We used the Viola-Jones face detection algorithm for driver face region detection. The Viola-Jones face detection algorithm has the characteristics of rapidness, stability and accuracy, and is the face detection algorithm which is most widely used. But it does not detect faces over a certain rotation angle well. In the invention, the position of the camera is opposite to the face of the driver, so that the detection rate of the face of 100 percent can be ensured. After detecting the face region, we unify the sizes of the consecutive frames to 200 × 200.

In a real driving environment, vibration occurs due to the movement of the automobile. This results in noise and interference caused by vibration during shooting of the vehicle-mounted camera. In order to reduce the interference caused by the noise as much as possible, a fast median filtering method is adopted for denoising. The fast median filtering method is a GPU-accelerated version of the median filtering method. The median filtering method can effectively remove scattered point noise and salt and pepper noise, and most of noises generated by vibration belong to the two types. Therefore, the fast median filtering method can be adopted to achieve the best denoising effect and minimize the interference brought by the noise.

In the pre-processing stage, the video is divided into a 30 frame/sec sequence of frames. Since the difference of information amount between adjacent frames in the original frame sequence is small, a large number of redundant frames exist, and these redundant frames degrade the precision of motion classification, especially for fine motion with small motion variation amplitude. In order to solve the problems, the invention provides an effective real-time key frame extraction algorithm based on image similarity threshold screening and outlier similarity elimination. Fig. 2 shows a sequence of key frames selected by the algorithm.

The key frame extraction algorithm proposed by the present invention extracts a series of original video frames F = { F = { F = } _j Extracting a series of key frames K = { K } from j =1, \8230 _i I =1, \ 8230;, M }. Where M denotes the number of key frames selected from the original frames and N denotes the number of original frames. The invention combines threshold-based histogram similarity filtering with outlier detection. The picture histogram has the advantage of low computational cost. Compared with local features, global features such as image distance and histogram in classification can effectively reduce false alarm.

The key frame extraction algorithm provided by the invention comprises two selection stages:

wherein j is more than or equal to 1 and less than or equal to N-1, N is the dimension of the image color histogram. Through calculation, we obtain a product including F _j And F _j+1 A set of similarities S. We need to determine a similarity threshold T _s To make the selection of key frames. This threshold can represent an average level of similarity between frames. We consider two threshold calculation methods. Half of the maximum and minimum similarity and average similarity. We use these two thresholds to select key frames from our self-collected data set and the processed YawDD reference data set. Our network was trained on self-collected data sets and tested on processed YawDD data sets. The results are shown in table 1, where s represents the similarity set, and YT is an abbreviation for yawning when talking. From the results we can see that using the average similarity as the metric threshold allows our hakurt detection method to achieve the best overall result. Half of the maximum frame similarity and half of the minimum frame similarity fuse the two extreme similarities. The threshold value cannot represent an average of the degrees of similarity of these facial movements. The average similarity as a threshold may select the most representative keyframes.

TABLE 1 results of experiments at two thresholds (unit:%)

Calculating a similarity threshold T by formula (2) _s ：

T _s ＝μ _s (2)

Wherein, mu _s Is Mean (S), S is S _j When set of S _j ＞T _s When it is considered to be F _j And F _j+1 The similarity is smaller, will F _j And adding the candidate key frame queue.

In the second selection stage, candidate key frame pictures with outlier features in the candidate key frames are removed to obtain a final key frame, and two image similarity measurement indexes are used: euclidean Distance (ED) and Root Mean Square Error (RMSE), using the Mean Absolute Deviation (MAD) to detect frames with outliers, calculating MAD according to equation (3):

MAD＝median(|X _i -median(X)|) (3)

recording two consecutive candidate key frames as K _i，i+1 For all K _i，i+1 Calculating their RMSE and ED values, for each RMSE (K) calculated _i，i+1 ) And ED (K) _i，i+1 ) Their MAD values are calculated as α = MAD (RMSE), β = MAD (ED), RMSE (K) _i，i+1 ) Is shown in formula (4), ED (K) _i，i+1 ) When RMSE (K) is expressed as formula (5) _i，i+1 ) Less than alpha and ED (K) _i，i+1 ) When it is less than beta, K is considered to be _i Is a candidate key frame with outlier features and compares K _i The candidate key frames are removed.

Where n represents K _i The size of (c).

In the present invention, another important contribution is to introduce a motion recognition mechanism into the yawning detection. In recent years, motion recognition has been greatly improved in terms of both accuracy and speed. Researchers have proposed various networks to identify actions. A widely used framework for motion recognition is a dual-stream fusion network and a 3D convolutional network.

3D convolutional networks have attracted much attention in motion recognition, scene and object recognition, and motion similarity analysis. Compared with other space-time characteristic extraction methods based on the double-flow network, the three-dimensional convolution network has the advantages of high calculation speed and high accuracy. Some researchers have attempted to superimpose 2D convolved continuous feature maps to classify video motion, but temporal information has also been lost during the 2D convolution process. In contrast, 3D convolutional networks use a number of consecutive video frames as input, as shown in FIG. 3. 3D convolutional networks achieve better modeling of temporal information through 3D convolution and 3D pooling operations. Experiments show that 3D convolution kernels with the size of 3 x 3 can extract the most representative spatio-temporal features.

Based on three-dimensional convolution, the invention provides a 3D-LTS network with a low time sampling rate, which is used for extracting space-time characteristics and identifying fine actions. The 3D-LTS network uses 3D convolution to extract spatio-temporal features and employs a softmax layer for classification. After data pre-processing and key frame selection, it is important to determine how many frames to use as input to the 3D-LTS network to obtain the best recognition performance. We compared the results of the 3D-LTS network with different input frame numbers. Our network was trained on self-collected data sets and tested on processed YawDD data sets. The results are shown in Table 2. From the overall recognition results, the results show that our 3D-LTS network is not very sensitive to the number of input frames. Our network shows better performance using 8 non-superframe frames as input. 3D-LTS uses four 3D convolutional layers to extract spatiotemporal features from consecutive frames. The structure of the 3D-LTS is shown in FIG. 4. From the block diagram we can see that all convolution filters are 3 × 3 × 3 with steps of 1 × 1 × 1. All pool levels are max pools. If we slow the collection rate of the shallow collection layers in the time dimension, the deep collection layers can extract more representative temporal features from the shallow collection layers. This is important to recognize subtle behaviors. Based on this theoretical analysis, the kernel size of the first and second pooling layers in our 3D-LTS is 1 × 2 × 2. The first four convolutional layers have filters in numbers of 32, 64, 128 and 256, respectively. The third cell core layer was 2X 4. The convolutional layer is followed by a fully connected layer for mapping features. We used a fully connected layer with 1024 outputs. Fully connected layers are used to integrate the distribution of features. We have found that our 3D-LTS achieves the best recognition performance when it is followed by a fully connected layer.

TABLE 2 experimental results (unit:%) for different input frame number of network

In the experiment of the invention, firstly, a standard yawDD data set is adopted to test the system, and the yawDD data set is a public yawDD data set. It can be used for verifying face detection, face feature extraction, yawning detection and other algorithms. The data set collected a series of motion videos from volunteers of different genders, ages, countries and races. The data set contains 351 videos. It records three to four videos for each driver, including different mouth conditions such as speaking, yawning and yawning.

Since most video segments in the YawDD data set last more than one minute in duration and contain multiple facial movements, we need to divide the video segments in the YawDD data set into video segments containing only a single movement. In this way, we constructed a YawDDR dataset based on the YawDD dataset. The video length in the YawDDR dataset is about 8 seconds. There are three operations in this data set: talk (T), yawning (Y) and Yawning (YT) during talk. 486 image sequences were collected in YawDDR. Some examples of data sets (before and after face segmentation) are shown in fig. 5 and 6. We use this data set to verify the validity of our method.

Many facial data sets are used for identity recognition, facial expression recognition and face detection. However, none of the public driver yawning detection data sets includes various facial movements. The purpose of collecting this dataset is to verify the efficiency of our method for driver yawning detection in various facial movements. Therefore, our MFAY data set is constructed by using HD cameras in the actual driving environment. We classified various facial movements into six levels that may occur during driving. The driver is speaking (T); yawning (YT) when talking; yawn (Y); singing (S); yawning (YS) when singing; shout (ST). In view of the risk of fatigue driving, our collection sites were chosen on a wide road with few pedestrians. Without affecting driving, a mini high-definition camera is installed in front of the driver to capture their facial movements. During the experiment, the driver drives the car under different lighting and road conditions. In the co-driver of the vehicle, the investigator continuously monitored changes in facial movements for each subject to annotate the ground truth for each facial movement. The positions of the high definition camera and the driver are shown in fig. 7.

Facial videos of 20 test persons (ranging in age from 20 to 46 years) were obtained under different conditions while the car was in motion. A sample image of the MFAY data set is shown in fig. 8. All video is converted to audio-video interlaced format with a video rate of 30fps. Finally, as shown in fig. 9, 347 image sequences (53652 images) are extracted from the obtained video. Each image sequence is approximately 5 seconds (150 frames) in length.

The invention carries out the following three experiments based on the YawDDR data set and the MFAY data set:

experiment one: to demonstrate that our key frame extraction algorithm can effectively select key frames in a sequence of driving video frames, we performed the following experiments on the YawDDR and MFAY datasets. First, the picture histogram is used to remove the very small difference frames and select candidate key frames, which we record as stage one. In order to verify that the algorithm can effectively improve the recognition rate of various facial actions, the algorithm also provides a recognition result without using any key frame extraction algorithm. We refer to this case as "unused". The results are shown in Table 3. After stage one, accuracy is improved. On stage one basis, we use the MAD to cull candidate key-frames with outlier features. After this process we will get the required key frame. As can be seen from table three, our two-stage key frame extraction algorithm achieves the best recognition performance compared to stage one and no key frame selection. The effectiveness of our key frame extraction algorithm is verified.

TABLE 3 Experimental results for different key frame selection stages (unit:%)

Experiment two: in this experiment, we focused on comparative experiments between our proposed method and some other existing image-based methods. We compared our method with the method based on coring a fuzzy rough set proposed by Du Y et al. Based on the bi-fold proxy expert system algorithm proposed by Anitha C et al. And two convolutional neural network based approaches. To validate the effectiveness of our method, we employed the following model training and testing strategy: the training set includes random video clips extracted from the MFAY dataset and the YawDDR dataset according to the category to which they belong. The remaining video clips are used to test the model. All video clips are processed by our proposed key frame extraction algorithm. The selected key frames are used to train and test the network model. Since image-based methods cannot effectively detect such actions as YT, the experimental results in these cases are not recorded in our tables and figures. As shown in Table 4 and FIG. 10, the recognition rate of the detection method of the yawning based on the fine facial motion and the video key frame is significantly improved compared to other methods. The method for identifying various facial actions is superior to the existing method, and error detection is effectively reduced. The video-based method can effectively extract enough space-time action characteristics and realize dynamic yawning detection. This further verifies the robustness of our proposed method.

TABLE 4 test results of the method of the present invention and four advanced methods on YAWDDR dataset (unit:%)

Experiment three: in this experiment, we compared an image-based approach with a video-based approach. Our method uses consecutive frames as input, which is a video-based method. For the image-based approach, the frame images in the YawDDR dataset and the MAFY dataset are used for training and testing. We extract some frames evenly from both datasets and assign them labels according to the class to which they belong. The data processing steps and validation algorithms for these experiments were identical. The results of the experiment are shown in table 5. The results show that the video-based approach has better performance than the image-based approach because the yawning is a continuous action rather than a static one. The video-based approach can detect yawning in various facial situations. If only one frame is used for identification, important temporal motion information between frames will be lost. Features that indicate yawning may be confused with features that indicate an action singing or shouting. In contrast, video-based methods may provide sufficient spatiotemporal motion information that may classify motion through a sequence of motion frames. Considering yawning as a motion rather than a stationary state to detect, the large number of false detection problems in the static image detection based method can be significantly improved.

TABLE 5 Picture-based and video-based detection methods experimental results (unit:%)

Experiments show that the method provided by the invention is superior to the existing method in terms of identification rate and overall performance, can effectively distinguish yawning from other facial fine actions, and effectively reduces the false detection rate of yawning actions of a driver.

The foregoing merely represents preferred embodiments of the invention, which are described in some detail and detail, and therefore should not be construed as limiting the scope of the invention. It should be noted that various changes, modifications and substitutions may be made by those skilled in the art without departing from the spirit of the invention, and all are intended to be included within the scope of the invention. Therefore, the protection scope of the present patent should be subject to the appended claims.

Claims

1. The fine facial action recognition-based yawning detection method for the driver is characterized by comprising the following steps of: the method comprises the following steps:

the key frame extraction algorithm is from series of originals video frame F = { F _j J =1, \8230;, N } extracting a series of key frames K = { (K) } _i I =1, \ 8230;, M }; where M denotes the number of key frames selected from the original frames and N denotes the number of original frames, the key frame extraction algorithm comprising two selection phases:

wherein j is more than or equal to 1 and less than or equal to N-1, N is the dimension of the image color histogram;

calculating a similarity threshold T by formula (2) _s ：

T _s ＝μ _s (2)

Wherein, mu _s Is Mean (S), S is S _j When set of S _j >T _s When it is, consider F _j And F _j+1 The similarity is smaller, F _j Adding a candidate key frame queue;

MAD＝median(|X _i -median(X)|) (3)

let two consecutive candidate key frames be denoted as K _i,i+1 For all K _i,i+1 Calculating their RMSE and ED values, for each RMSE (K) calculated _i,i+1 ) And ED (K) _i，i+1 ) Their MAD values are calculated and are denoted α = MAD (RMSE), β = MAD (ED), RMSE (K) _i,i+1 ) Is shown in formula (4), ED (K) _i，i+1 ) The formula (5) is shown in the formula (5), when RMSE (K) _i,i+1 ) Less than alpha and ED (K) _i，i+1 ) When it is less than beta, K is considered to be _i Is a candidate key frame with outlier features and compares K _i Removing the candidate key frame;

where n represents K _i The size of (d);

where m represents the picture color histogram dimension of the candidate keyframe;

step 3, establishing a 3D deep learning network, namely a 3D-LTS network, according to the selected key frame to detect various yawning behaviors;

the 3D-LTS network is used for space-time feature extraction and fine motion identification, the 3D-LTS network uses 8 non-superframe frames as input, four 3D convolutional layers are used for extracting space-time features from continuous frames, all convolutional filters are 3 x 3, the step is 1 x 1, all pool layers are maximum pools, the kernel size of the first and second gathering layers is 1 x 2, the filter number of the first four convolutional layers is 32, 64, 128 and 256 respectively, the kernel layer of the third pool is 2 x 4, and the convolutional layers are followed by a fully-connected layer for mapping the features, and the fully-connected layer with 1024 outputs is used for distribution of integrated features.

2. The fine facial motion recognition-based driver yawning detection method as claimed in claim 1, wherein: the preprocessing of the driver driving video captured by the vehicle-mounted camera comprises the following steps: and detecting the face area of the driver by adopting a Viola-Jones face detection algorithm, segmenting the face area of the driver, and denoising by adopting a rapid median filtering method.