CN110717389A

CN110717389A - Driver fatigue detection method based on generation of countermeasure and long-short term memory network

Info

Publication number: CN110717389A
Application number: CN201910824620.XA
Authority: CN
Inventors: 路小波; 胡耀聪; 陆明琦
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-09-02
Filing date: 2019-09-02
Publication date: 2020-01-21
Anticipated expiration: 2039-09-02
Also published as: CN110717389B

Abstract

The invention discloses a driver fatigue detection method based on generation countermeasure and long-short term memory networks, wherein a network architecture consists of a 3D condition generation countermeasure network and a bidirectional long-short term memory network, the 3D condition generation countermeasure network is used for extracting fatigue related features from a short-term video clip, the 3D generation network takes a coding and decoding U-NET network as a skeleton network, and takes fatigue related labels as conditions to generate the video clip; the 3D discrimination network takes the real clips and the synthesized clips as input and extracts short-term space-time feature representation with fatigue related information. The bidirectional long-short term memory network is used for long-term space-time feature fusion, capturing context information between frames and finally outputting fatigue detection results of each frame. Compared with the existing driver fatigue detection method, the method has the advantages of strong generalization and high recognition accuracy, and can be used for real-time driver fatigue recognition under the condition of monitoring videos. The invention has important application value in the field of traffic safety.

Description

Driver fatigue detection method based on generation of countermeasure and long-short term memory network

Technical Field

The invention belongs to the field of image processing and pattern recognition, and relates to a driver fatigue detection method based on a generative confrontation and long-short term memory network.

Background

Fatigue driving refers to driving a driver in an insufficient sleep state or a drowsy state, which usually shows states of yawning, eye closing, head hanging, and the like. According to the survey of the department of transportation of China, 9 thousands of people die of fatigue driving every year in China, and account for 6 percent of all traffic accident deaths. The fatigue driving seriously influences the attention of the driver and threatens the road safety all the time, so that the real-time monitoring of whether the driver is in a fatigue state has important research significance for the road safety and intelligent traffic.

In early fatigue detection systems, fatigue detection methods also typically rely on the use of sensors. For example, a heart rate, a retinal signal, brain waves, etc. of the driver are monitored using a biosensor. However, these sensors need to be placed on the human body, and to some extent, may also be distracting to the driver. In recent years, it has become possible to provide,

automatic driver fatigue identification methods based on computer vision have become a focus of research. The method relies on the real-time acquisition of the face of the driver by the vehicle-mounted camera and the automatic analysis of the fatigue degree through the facial feature extraction. However, the identification accuracy of the current algorithms is not high, and the following difficulties mainly exist:

(1) fatigue driving is an abstract state, and different drivers exhibit large intra-class variance in fatigue performance. Common manual design features are difficult to characterize for this state.

(2) The determination of fatigue driving is susceptible to the influence of scene states, such as illumination change, eye shielding caused by wearing glasses, and the like.

(3) Fatigue driving relies on long-term spatiotemporal characterization. The short-term space-time characteristics are difficult to judge the current fatigue state, and high false alarm rate is easy to cause.

Disclosure of Invention

The invention aims to solve the problems and provides a driver fatigue detection method based on generation of a countermeasure and long-short term memory network. Firstly, a face sequence is obtained by using a face detection tracking algorithm, a 3D generation countermeasure network is designed to obtain short-term space-time characteristics, a bidirectional long-term and short-term memory network is designed to perform space-time characteristic fusion, and finally the fatigue degree of a driver in each frame of image is output.

In order to achieve the purpose, the method adopted by the invention is as follows: a driver fatigue detection method based on generation of countermeasure and long-short term memory networks comprises the following steps:

step 1: a driver fatigue detection dataset is acquired. The present invention uses the disclosed NTHU-DDD fatigue test dataset. The data set contained 360 training videos (722223 frames) and 20 test videos (173259 frames), as shown in FIG. 1. All videos are recorded by an infrared camera in an indoor simulated driving environment. The participants record the two driving modes of normal driving and fatigue driving under different environments. The scene environment includes: no glasses, sunglasses, glasses, and night. The recorded video has a resolution of 640 x 480 and a frame rate of 30 fps. Each video in the data set has four label files, and the fatigue states of each frame of image in the video are recorded, including eye states (normal and closed eyes), mouth states (normal, yawning and talking), and head states (normal, and head drooping without visual contact).

In the present invention, all 360 training videos of the data set are used for training 3D conditions to generate a countermeasure network and a bidirectional long-short term memory network, and the rest 20 videos are used for model testing.

Step 2: and designing a face detection tracking algorithm. In a fatigue detection system, the determination of the fatigue state depends entirely on the face region, whereas the background region is redundant in fatigue detection. In the invention, a method of detecting and tracking is used for acquiring the face area of each frame in the video. In the initial frame of the video, the MTCNN open source algorithm is used for detecting the human face, and in the subsequent frame, the kernel correlation filtering algorithm is used for tracking the human face area.

And step 3: training 3D conditions to generate the confrontation network, wherein the network model consists of a 3D coding and decoding generation network and a 3D discrimination network, as shown in FIG. 2.

Step 301: the 3D coding and decoding generation network takes a U-NET network as a skeleton network, and the input of the network is a three-channel face sequence of continuous adjacent T frames, and the size of the three-channel face sequence is 3 multiplied by T multiplied by 64. Through 3D coding and decoding, the output is a synthesized face sequence, and the size of the synthesized face sequence is the same as that of the input real face sequence. In the coding sub-network, convolution kernels of size 3 × 3 × 3 are applied to multiple 3D convolutional layers, learning a global spatio-temporal feature representation. The global average pooling layer maps the 3D convolved feature map into a 512-dimensional feature vector. The operation process of the coding sub-network can be specifically expressed as:

X＝G_en(I_real|θ_en)， (1)

wherein I_realRepresenting the input real face sequence, theta_enRepresenting parameters of the coding sub-network and X representing the output synthetic face sequence.

Step 302: the tag information is embedded as a condition into the feature vector of the encoding subnetwork output. Specifically, noise coding n, class label coding l is concatenated with the output X of the coding network as input to the decoding subnetwork. Where the noise code n consists of a random noise vector of 100 dimensions. The label code l is composed of fatigue-related class information cascade, and specifically comprises a fatigue state label l_drowLight condition label l_illGlasses wearing label l_glaEye status label l_eyeMouth status label l_mouHead status label l_headThe specific label encoding method is shown in table 1.

Step 303: the decoding generation sub-network is composed of a plurality of 3D deconvolution networks, the size of a deconvolution kernel is 3 multiplied by 3, the decoding generation sub-network performs up-sampling on the coding labels, and finally a synthetic face sequence is generated and input into the 3D discrimination network. In the 3D generation network, the coding sub-network and the decoding sub-network adopt a residual connection mode, so that the synthesized face clip can keep detailed space-time information. The operation process of decoding the sub-network can be specifically expressed as:

I_fake＝G_de(X,n,l|θ_gen)， (2)

wherein I_fakeIndicating decoding of the face clip synthesized by the sub-network. Theta_gen＝{θ_en,θ_decMeans 3D Condition generationParameters of the network, including encoded network parameter θ_enAnd decoding the network parameter theta_de。

The whole 3D codec generation network can be regarded as a mapping from an input real face sequence to an output synthesized face sequence, which can be expressed by the following formula.

I_fake＝G(I_real,n,l|θ_gen). (3)

Step 304: 3D discrimination network simultaneously inputting real face sequence I_realAnd synthesizing the face sequence I_fake. The network is similar to the coding sub-network in the 3D conditional generation network, is composed of multiple 3D convolutional layers, and includes a max-means pooling layer, a full connection layer, and a softmax layer. The 3D discrimination network adopts a multi-task learning strategy to extract short-term space-time characteristic representation, discriminate the authenticity of the face sequence and classify fatigue related state information. The operation process of the 3D discrimination network can be specifically expressed as:

wherein I ═ { I ═ I_real,I_fakeThe method is input by a 3D discrimination network and comprises a real face sequence I_realAnd synthesizing the face sequence I_fake。θ_disThe parameters of the network are discriminated for 3D,

is a 512-dimensional spatio-temporal feature representation. softmax (. |) represents the softmax sort operation, θ_clsRepresenting the relevant parameters of the softmax classifier. score represents a classification score, and includes a result of classifying the authenticity judgment score and fatigue-related state of the face sequence.

Step 305: 3D generation network learns short-term spatiotemporal information and generates a synthesized face sequence for a given input face sequence and its corresponding state label l ═ { l ═ for_drow,l_ill,l_gla,l_eye,l_mou,l_headThe 3D generation network comprises the following training tasks:

(1)3D Generation network Generation of synthetic face sequences I_realSo that the 3D discrimination network cannot determine its authenticity, which can be specifically expressed as:

wherein G (-) denotes the generation of a network-synthesized face sequence, D^realness(. cndot.) represents the authenticity score,representing the competing losses of the generating network.

(2)3D generation network approximation input I by regression loss_realAnd an output I_fakeThe distance between, similar to a self-coding neural network, the loss can be expressed as:

where | · | | represents the two-norm distance of the real face sequence and the synthetic face sequence. Loss of return

Can improve the authenticity of the synthesized face sequence, thereby enhancing the performance of the 3D discrimination network,

(3) face sequence I for 3D generation network synthesis_realThe 3D discrimination network can accurately classify the short-term fatigue state information, and the cross loss entropy function optimization softmax classifier can be specifically expressed as follows:

wherein

The classification score, alpha, representing the jth fatigue-related state_j'A weight parameter representing the fatigue state of different properties.

The training loss of the 3D generation network is a weighted combination of the losses of the different learning tasks, and the final loss function can be expressed as:

wherein

Weight parameters representing different losses in the 3D generating network.

Step 306: the 3D discriminative network can be regarded as a multitask 3D convolutional neural network, mainly including the following two tasks:

(1)3D discrimination network can correctly distinguish real face sequence I_realAnd synthesizing the face sequence I_fakeThe loss can be expressed in particular as:

the confrontation loss of the 3D discrimination network consists of two cost points of classification loss of real face sequences and classification loss of synthetic face sequences.

(2)3D discrimination network can correctly classify real face sequence samples I_realThe short-term fatigue state information and the cross-loss entropy function of the soft max classifier are specifically expressed as follows:

wherein

The training loss of the 3D discriminative network is a weighted combination of the losses of different learning tasks, and the final loss function can be expressed as:

wherein

Weight parameters representing different losses in the 3D discriminatory network.

Step 307: training the 3D condition generates a countermeasure network. A network model is built by using a Pythrch open source tool, the training process of the whole network model runs on an Intel Core I7 server, and a NVIDIA TITAN X GPU and an Ubuntu18.04 operating system are used. Using Adam's algorithm to optimize network parameters, 3D generation of a countermeasure network was used only to generate face sequences and to discern the authenticity of the sequences, i.e. during the initial K training roundsAndis set to 0 and then the weight parameters are adjusted to extract spatio-temporal features and classify short-term fatigue state information.

And 4, step 4: and training a bidirectional long-short term memory network for obtaining long-term space-time correlation information so as to realize final fatigue classification.

Step 401: the long-short term memory cells are the basic units of the recurrent neural network structure, as shown in fig. 3. The LSTM unit comprises a memory cell and three control gates, three control gate input gates, a forgetting gate and an output gate. The input gate i (t) may modulate the input z (t) of the LSTM unit. The memory unit c (t) records the current memory state. The output h (t) of the LSTM unit is determined by the forgetting gate f (t) and the output gate o (t). For N continuous frames in the video, short-term space-time feature representation can be obtained through the steps 2 and 3

And the bidirectional long-short term memory network takes the continuous short-term space-time characteristics extracted in the step 3 as input and outputs the fatigue score of each frame of image. The one-way LSTM operation process can be expressed as:

z(t)＝tanh(W_zX(t)+R_zh(t-1)+b_z) (16)

where W represents the weight matrix of the current state input, R represents the weight matrix of the last state output, and b represents the threshold term. Sigma is sigmoid function, tanh is double tangent function,

representing the elemental inner product. The output of the LSTM unit depends on the current state and the previous state, i.e., spatio-temporal fusion between sequences is achieved.

Step 402: the bidirectional long-short term memory network comprises a forward LSTM unit and a backward LSTM unit, the outputs of the forward LSTM unit and the backward LSTM unit are respectively

And

and a final fatigue scoreI.e., the fusion score of forward LSTM and backward LSTM, can be expressed as:

wherein

Indicating the bitwise addition operation of the matrix elements, and y (t) indicating the fatigue score of the final output. The overall structure of the bidirectional long-short term memory network is shown in fig. 4.

Step 403: and training the bidirectional long-short term memory network. A network model is built by using a Pythrch open source tool, the training process of the whole network model runs on an Intel Core I7 server, and a NVIDIA TITAN X GPU and an Ubuntu18.04 operating system are used. And (3) inputting the short-term space-time characteristics output in the step (3) into the bidirectional long-term and short-term memory network, and outputting the final fatigue score.

And 5: the fatigue detection method based on the generation countermeasure and long-short term memory network provided by the invention is tested, and the overall schematic diagram of the framework is shown in FIG. 5. And (3) giving a test video, acquiring a face sequence through the step (2), and generating a confrontation network model by using the 3D condition trained in the step (3) to obtain short-term space-time characteristic representation. And (3) performing long-term space-time feature fusion by using the two-way long-short term memory network trained in the step (3), and finally outputting a fatigue identification result of each frame in the video.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

figure 2 is a sample graph of a fatigue driving data set in the present invention,

figure 3 is a schematic diagram of a 3D conditional generation countermeasure network of the present invention,

figure 4 is a schematic diagram of an LSTM unit of the present invention,

FIG. 5 is a schematic diagram of the bidirectional long-term and short-term memory network structure of the present invention,

FIG. 6 is a schematic diagram of a fatigue detection algorithm framework in the present invention.

Detailed Description

The present invention will be further described with reference to the following detailed description and the accompanying drawings, it being understood that the preferred embodiments described herein are merely illustrative and explanatory of the invention and are not restrictive thereof.

The method comprises the following specific implementation steps:

X＝G_en(I_real|θ_en)， (1)

TABLE 1 fatigue State tag encoding scheme

I_fake＝G_de(X,n,l|θ_gen)， (2)

wherein I_fakeIndicating decoding of the face clip synthesized by the sub-network. Theta_gen＝{θ_en,θ_decDenotes parameters of the 3D conditional access network, including the encoded network parameter θ_enAnd decoding the network parameter theta_de。

I_fake＝G(I_real,n,l|θ_gen). (3)

Step 305: 3D rawNetwork learning short-term spatio-temporal information and generating a composite face sequence for a given input face sequence and its corresponding state label l ═ l_drow,l_ill,l_gla,l_eye,l_mou,l_headThe 3D generation network comprises the following training tasks:

wherein G (-) denotes the generation of a network-synthesized face sequence, D^realness(. cndot.) represents the authenticity score,

representing the competing losses of the generating network.

(4) face sequence I for 3D generation network synthesis_realThe 3D discrimination network can accurately classify the short-term fatigue state information, and the cross loss entropy function optimization softmax classifier can be specifically expressed as follows:

wherein

wherein

Weight parameters representing different losses in the 3D generating network.

wherein

A classification score representing the jth fatigue-related state,α_j'a weight parameter representing the fatigue state of different properties.

wherein

Step 307: training the 3D condition generates a countermeasure network. A network model is built by using a Pythrch open source tool, the training process of the whole network model is run on an Intel Core I7 server, and an operating system NVIDIATITAN X GPU and Ubuntu18.04 are used. Using Adam's algorithm to optimize network parameters, 3D generation of a countermeasure network was used only to generate face sequences and to discern the authenticity of the sequences, i.e. during the initial K training rounds

Andis set to 0 and then the weight parameters are adjusted to extract spatio-temporal features and classify short-term fatigue state information.

Step 401: the long-short term memory cells are the basic units of the recurrent neural network structure, as shown in fig. 3. The LSTM unit comprises a memory cell and three control gates, three control gate input gates, a forgetting gate and an output gate. The input gate i (t) may modulate the input z (t) of the LSTM unit. The memory unit c (t) records the current memory state. The output h (t) of the LSTM unit is determined by a forgetting gate f (t) and an output gate o (t)And (4) determining. For N continuous frames in the video, short-term space-time feature representation can be obtained through the steps 2 and 3

z(t)＝tanh(W_zX(t)+R_zh(t-1)+b_z) (16)

Step 402: the bidirectional long-short term memory network comprises a forward LSTM unit and a backward LSTM unit, the outputs of the forward LSTM unit and the backward LSTM unit are respectivelyAndthe final fatigue score, i.e. the fusion score of the forward LSTM and the backward LSTM, can be specifically expressed as:

whereinIndicating the bitwise addition operation of the matrix elements, and y (t) indicating the fatigue score of the final output. The overall structure of the bidirectional long-short term memory network is shown in fig. 4.

Step 403: and training the bidirectional long-short term memory network. A network model is built by using a Pythrch open source tool, the training process of the whole network model is run on an Intel Core I7 server, and an operating system NVIDIATITAN X GPU and Ubuntu18.04 are used. And (3) inputting the short-term space-time characteristics output in the step (3) into the bidirectional long-term and short-term memory network, and outputting the final fatigue score.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. The fatigue detection method based on the generation countermeasure and the long-short term memory network is characterized by comprising the following steps:

step 1: acquiring a driver fatigue detection data set: the disclosed NTHU-DDD fatigue detection data set is used, the data set comprises 360 training videos and 20 testing videos, all videos are recorded by an infrared camera in an indoor simulated driving environment, participants record two driving modes of normal driving and fatigue driving under different environments, and the scene environment comprises: glasses are not worn in the daytime, glasses are worn in the daytime, sunglasses are worn in the daytime, glasses are not worn at night and glasses are worn at night, the resolution of the recorded video is 640 multiplied by 480, and the frame rate is 30 fps; each video in the data set has four annotation files, and the fatigue state of each frame of image in the video is recorded, including the eye state: normal, closed eye, mouth state: normal, yawning, conversation, head state: normal, no visual inspection ahead, head drooping; all 360 training videos of the data set are used for training 3D conditions to generate an antagonistic network and a bidirectional long-short term memory network, and the rest 20 videos are used for model testing;

step 2: designing a face detection tracking algorithm: acquiring a face area of each frame in a video by using a detection and tracking combined method, detecting a face by using an MTCNN open source algorithm in an initial frame of the video, and tracking the face area by using a kernel correlation filtering algorithm in a subsequent frame;

and step 3: training a 3D condition to generate a confrontation network, wherein the network model consists of a 3D coding and decoding generation network and a 3D discrimination network, and the specific steps are as follows:

step 301: the 3D coding and decoding generation network takes a U-NET network as a skeleton network, the input of the network is a three-channel face sequence of continuous adjacent T frames, the size is 3 multiplied by T multiplied by 64, the output of the network is a synthesized face sequence through 3D coding and decoding, the size is the same as that of the input real face sequence, in a coding sub-network, a convolution kernel with the size of 3 multiplied by 3 is applied to a plurality of 3D convolution layers, the global space-time feature representation is learned, a global average pooling layer maps a 3D convolution feature map into 512-dimensional feature vectors, and the operation process of a coding sub-network can be specifically expressed as follows:

X＝G_en(I_real|θ_en)， (1)

wherein I_realRepresenting the input real face sequence, theta_enParameters representing a coding sub-network, X represents an output synthetic face sequence;

step 302: the tag information is embedded as a condition into the feature vector of the encoding subnetwork output. Specifically, noise code n, category label code l and the output X of the coding network are cascaded to serve as the input of a decoding sub-network, wherein the noise code n is composed of random noise vectors with 100 dimensions, and the label code l is composed of fatigue-related category information cascade and specifically comprises a fatigue state label l_drowLight condition label l_illGlasses wearing label l_glaEye status label l_eyeMouth status label l_mouHead status label l_head；

Step 303: the decoding generated sub-network route is composed of a plurality of 3D deconvolution networks, the size of a deconvolution kernel is 3 multiplied by 3, the decoding generated sub-network carries out up-sampling on the coding label and finally generates a synthesized face sequence and inputs the synthesized face sequence into the 3D discrimination network, in the 3D generation network, the coding sub-network and the decoding sub-network adopt a residual error connection mode, so that the synthesized face clip can keep detailed space-time information, and the operation process of the decoding sub-network can be specifically expressed as:

I_fake＝G_de(X,n,l|θ_gen)， (2)

wherein I_fakeRepresenting face clips, theta, synthesized by decoding sub-networks_gen＝{θ_en,θ_decDenotes parameters of the 3D conditional access network, including the encoded network parameter θ_enAnd decoding the network parameter theta_de，

I_fake＝G(I_real,n,l|θ_gen). (3)

Step 304: 3D discrimination network simultaneously inputting real face sequence I_realAnd synthesizing the face sequence I_fake: the operation process of the 3D discrimination network can be specifically expressed as:

is a space-time characteristic representation with 512 dimensions, softmax (· |) represents softmax classification operation, theta_clsExpressing relevant parameters of a softmax classifier, and expressing classification scores by score, wherein the classification results comprise authenticity judgment scores of the face sequences and fatigue relevant states;

representing the countermeasure loss of the generating network;

(2)3D rawInput I is approximated by regression loss in network formation_realAnd an output I_fakeThe distance between, similar to a self-coding neural network, the loss can be expressed as:

wherein | · | | represents the two-norm distance between the real face sequence and the synthesized face sequence;

wherein

The classification score, alpha, representing the jth fatigue-related state_j'Weight parameters representing fatigue states of different attributes;

wherein

A weight parameter representing different losses in the 3D generation network;

the confrontation loss of the 3D discrimination network consists of two cost groups of classification loss of real face sequences and classification loss of synthesized face sequences;

wherein

wherein

Weight parameters representing different losses in the 3D discriminative network;

step 307: training a 3D condition to generate an antagonistic network, constructing a network model by using a Pythrch open source tool, running the training process of the whole network model on an Intel Core I7 server, using a NVIDIA TITAN X GPU and a Ubuntu18.04 operating system, optimizing network parameters by using an Adam algorithm, and in initial K training rounds, generating the antagonistic network by 3D only for generating the authenticity of a face sequence and a recognition sequence, namely

And

set to 0, and then adjust the weight parameters to extract spatio-temporal features and classify short-term fatigue state information;

and 4, step 4: training a bidirectional long-short term memory network for obtaining long-term space-time correlation information to realize final fatigue classification, and specifically comprising the following steps:

step 401: the LSTM unit comprises a memory cell and three control gates, three control gates including an input gate, a forgetting gate and an output gate, wherein the input gate i (t) can modulate the input z (t) of the LSTM unit, the memory unit c (t) records the current memory state, the output h (t) of the LSTM unit is jointly determined by the forgetting gate f (t) and the output gate o (t), and short-term space-time characteristic representation can be obtained through the steps 2 and 3 for continuous N frames in a video

The bidirectional long-short term memory network takes the continuous short-term space-time characteristics extracted in the step 3 as input and outputs the fatigue score of each frame of image, and the unidirectional LSTM operation process can be expressed as follows:

z(t)＝tanh(W_zX(t)+R_zh(t-1)+b_z) (16)

where W represents the weight matrix of the current state input, R represents the weight matrix of the last state output, and b represents the threshold term. Sigma is sigmoid function, tanh is double tangent function,representing the inner product of elements, wherein the output of the LSTM unit depends on the current state and the previous state, namely, the space-time fusion between sequences is realized;

Andthe final fatigue score, i.e. the fusion score of the forward LSTM and the backward LSTM, can be specifically expressed as:

wherein

Representing a bitwise addition operation of the matrix elements, and Y (t) representing the finally output fatigue score;

step 403: training a bidirectional long and short term memory network, building a network model by using a Pythrch open source tool, running the training process of the whole network model on an Intel Core I7 server, using NVIDIA TITAN X GPU and a Ubuntu18.04 operating system, inputting the input of the bidirectional long and short term memory network into the short term space-time characteristics output in the step 3, and outputting the input into a final fatigue score;

and 5: testing the fatigue detection method based on the generation of the confrontation and long-short term memory network, giving a test video, obtaining a face sequence through the step 2, generating a confrontation network model by using the 3D condition trained in the step 3 to obtain short-term space-time feature representation, performing long-term space-time feature fusion by using the bidirectional long-short term memory network trained in the step 3, and finally outputting a fatigue identification result of each frame in the video.