CN110717389A - Driver fatigue detection method based on generation of countermeasure and long-short term memory network - Google Patents

Driver fatigue detection method based on generation of countermeasure and long-short term memory network Download PDF

Info

Publication number
CN110717389A
CN110717389A CN201910824620.XA CN201910824620A CN110717389A CN 110717389 A CN110717389 A CN 110717389A CN 201910824620 A CN201910824620 A CN 201910824620A CN 110717389 A CN110717389 A CN 110717389A
Authority
CN
China
Prior art keywords
network
fatigue
real
face sequence
face
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910824620.XA
Other languages
Chinese (zh)
Other versions
CN110717389B (en
Inventor
路小波
胡耀聪
陆明琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201910824620.XA priority Critical patent/CN110717389B/en
Publication of CN110717389A publication Critical patent/CN110717389A/en
Application granted granted Critical
Publication of CN110717389B publication Critical patent/CN110717389B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • G06V20/597Recognising the driver's state or behaviour, e.g. attention or drowsiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a driver fatigue detection method based on generation countermeasure and long-short term memory networks, wherein a network architecture consists of a 3D condition generation countermeasure network and a bidirectional long-short term memory network, the 3D condition generation countermeasure network is used for extracting fatigue related features from a short-term video clip, the 3D generation network takes a coding and decoding U-NET network as a skeleton network, and takes fatigue related labels as conditions to generate the video clip; the 3D discrimination network takes the real clips and the synthesized clips as input and extracts short-term space-time feature representation with fatigue related information. The bidirectional long-short term memory network is used for long-term space-time feature fusion, capturing context information between frames and finally outputting fatigue detection results of each frame. Compared with the existing driver fatigue detection method, the method has the advantages of strong generalization and high recognition accuracy, and can be used for real-time driver fatigue recognition under the condition of monitoring videos. The invention has important application value in the field of traffic safety.

Description

Driver fatigue detection method based on generation of countermeasure and long-short term memory network
Technical Field
The invention belongs to the field of image processing and pattern recognition, and relates to a driver fatigue detection method based on a generative confrontation and long-short term memory network.
Background
Fatigue driving refers to driving a driver in an insufficient sleep state or a drowsy state, which usually shows states of yawning, eye closing, head hanging, and the like. According to the survey of the department of transportation of China, 9 thousands of people die of fatigue driving every year in China, and account for 6 percent of all traffic accident deaths. The fatigue driving seriously influences the attention of the driver and threatens the road safety all the time, so that the real-time monitoring of whether the driver is in a fatigue state has important research significance for the road safety and intelligent traffic.
In early fatigue detection systems, fatigue detection methods also typically rely on the use of sensors. For example, a heart rate, a retinal signal, brain waves, etc. of the driver are monitored using a biosensor. However, these sensors need to be placed on the human body, and to some extent, may also be distracting to the driver. In recent years, it has become possible to provide,
automatic driver fatigue identification methods based on computer vision have become a focus of research. The method relies on the real-time acquisition of the face of the driver by the vehicle-mounted camera and the automatic analysis of the fatigue degree through the facial feature extraction. However, the identification accuracy of the current algorithms is not high, and the following difficulties mainly exist:
(1) fatigue driving is an abstract state, and different drivers exhibit large intra-class variance in fatigue performance. Common manual design features are difficult to characterize for this state.
(2) The determination of fatigue driving is susceptible to the influence of scene states, such as illumination change, eye shielding caused by wearing glasses, and the like.
(3) Fatigue driving relies on long-term spatiotemporal characterization. The short-term space-time characteristics are difficult to judge the current fatigue state, and high false alarm rate is easy to cause.
Disclosure of Invention
The invention aims to solve the problems and provides a driver fatigue detection method based on generation of a countermeasure and long-short term memory network. Firstly, a face sequence is obtained by using a face detection tracking algorithm, a 3D generation countermeasure network is designed to obtain short-term space-time characteristics, a bidirectional long-term and short-term memory network is designed to perform space-time characteristic fusion, and finally the fatigue degree of a driver in each frame of image is output.
In order to achieve the purpose, the method adopted by the invention is as follows: a driver fatigue detection method based on generation of countermeasure and long-short term memory networks comprises the following steps:
step 1: a driver fatigue detection dataset is acquired. The present invention uses the disclosed NTHU-DDD fatigue test dataset. The data set contained 360 training videos (722223 frames) and 20 test videos (173259 frames), as shown in FIG. 1. All videos are recorded by an infrared camera in an indoor simulated driving environment. The participants record the two driving modes of normal driving and fatigue driving under different environments. The scene environment includes: no glasses, sunglasses, glasses, and night. The recorded video has a resolution of 640 x 480 and a frame rate of 30 fps. Each video in the data set has four label files, and the fatigue states of each frame of image in the video are recorded, including eye states (normal and closed eyes), mouth states (normal, yawning and talking), and head states (normal, and head drooping without visual contact).
In the present invention, all 360 training videos of the data set are used for training 3D conditions to generate a countermeasure network and a bidirectional long-short term memory network, and the rest 20 videos are used for model testing.
Step 2: and designing a face detection tracking algorithm. In a fatigue detection system, the determination of the fatigue state depends entirely on the face region, whereas the background region is redundant in fatigue detection. In the invention, a method of detecting and tracking is used for acquiring the face area of each frame in the video. In the initial frame of the video, the MTCNN open source algorithm is used for detecting the human face, and in the subsequent frame, the kernel correlation filtering algorithm is used for tracking the human face area.
And step 3: training 3D conditions to generate the confrontation network, wherein the network model consists of a 3D coding and decoding generation network and a 3D discrimination network, as shown in FIG. 2.
Step 301: the 3D coding and decoding generation network takes a U-NET network as a skeleton network, and the input of the network is a three-channel face sequence of continuous adjacent T frames, and the size of the three-channel face sequence is 3 multiplied by T multiplied by 64. Through 3D coding and decoding, the output is a synthesized face sequence, and the size of the synthesized face sequence is the same as that of the input real face sequence. In the coding sub-network, convolution kernels of size 3 × 3 × 3 are applied to multiple 3D convolutional layers, learning a global spatio-temporal feature representation. The global average pooling layer maps the 3D convolved feature map into a 512-dimensional feature vector. The operation process of the coding sub-network can be specifically expressed as:
X=Gen(Irealen), (1)
wherein IrealRepresenting the input real face sequence, thetaenRepresenting parameters of the coding sub-network and X representing the output synthetic face sequence.
Step 302: the tag information is embedded as a condition into the feature vector of the encoding subnetwork output. Specifically, noise coding n, class label coding l is concatenated with the output X of the coding network as input to the decoding subnetwork. Where the noise code n consists of a random noise vector of 100 dimensions. The label code l is composed of fatigue-related class information cascade, and specifically comprises a fatigue state label ldrowLight condition label lillGlasses wearing label lglaEye status label leyeMouth status label lmouHead status label lheadThe specific label encoding method is shown in table 1.
Step 303: the decoding generation sub-network is composed of a plurality of 3D deconvolution networks, the size of a deconvolution kernel is 3 multiplied by 3, the decoding generation sub-network performs up-sampling on the coding labels, and finally a synthetic face sequence is generated and input into the 3D discrimination network. In the 3D generation network, the coding sub-network and the decoding sub-network adopt a residual connection mode, so that the synthesized face clip can keep detailed space-time information. The operation process of decoding the sub-network can be specifically expressed as:
Ifake=Gde(X,n,l|θgen), (2)
wherein IfakeIndicating decoding of the face clip synthesized by the sub-network. Thetagen={θendecMeans 3D Condition generationParameters of the network, including encoded network parameter θenAnd decoding the network parameter thetade
The whole 3D codec generation network can be regarded as a mapping from an input real face sequence to an output synthesized face sequence, which can be expressed by the following formula.
Ifake=G(Ireal,n,l|θgen). (3)
Step 304: 3D discrimination network simultaneously inputting real face sequence IrealAnd synthesizing the face sequence Ifake. The network is similar to the coding sub-network in the 3D conditional generation network, is composed of multiple 3D convolutional layers, and includes a max-means pooling layer, a full connection layer, and a softmax layer. The 3D discrimination network adopts a multi-task learning strategy to extract short-term space-time characteristic representation, discriminate the authenticity of the face sequence and classify fatigue related state information. The operation process of the 3D discrimination network can be specifically expressed as:
Figure BDA0002188460140000031
wherein I ═ { I ═ Ireal,IfakeThe method is input by a 3D discrimination network and comprises a real face sequence IrealAnd synthesizing the face sequence Ifake。θdisThe parameters of the network are discriminated for 3D,
Figure BDA0002188460140000033
is a 512-dimensional spatio-temporal feature representation. softmax (. |) represents the softmax sort operation, θclsRepresenting the relevant parameters of the softmax classifier. score represents a classification score, and includes a result of classifying the authenticity judgment score and fatigue-related state of the face sequence.
Step 305: 3D generation network learns short-term spatiotemporal information and generates a synthesized face sequence for a given input face sequence and its corresponding state label l ═ { l ═ fordrow,lill,lgla,leye,lmou,lheadThe 3D generation network comprises the following training tasks:
(1)3D Generation network Generation of synthetic face sequences IrealSo that the 3D discrimination network cannot determine its authenticity, which can be specifically expressed as:
wherein G (-) denotes the generation of a network-synthesized face sequence, Drealness(. cndot.) represents the authenticity score,representing the competing losses of the generating network.
(2)3D generation network approximation input I by regression lossrealAnd an output IfakeThe distance between, similar to a self-coding neural network, the loss can be expressed as:
Figure BDA0002188460140000043
where | · | | represents the two-norm distance of the real face sequence and the synthetic face sequence. Loss of return
Figure BDA0002188460140000044
Can improve the authenticity of the synthesized face sequence, thereby enhancing the performance of the 3D discrimination network,
(3) face sequence I for 3D generation network synthesisrealThe 3D discrimination network can accurately classify the short-term fatigue state information, and the cross loss entropy function optimization softmax classifier can be specifically expressed as follows:
Figure BDA0002188460140000045
wherein
Figure BDA0002188460140000046
The classification score, alpha, representing the jth fatigue-related statej'A weight parameter representing the fatigue state of different properties.
The training loss of the 3D generation network is a weighted combination of the losses of the different learning tasks, and the final loss function can be expressed as:
Figure BDA0002188460140000047
wherein
Figure BDA0002188460140000048
Weight parameters representing different losses in the 3D generating network.
Step 306: the 3D discriminative network can be regarded as a multitask 3D convolutional neural network, mainly including the following two tasks:
(1)3D discrimination network can correctly distinguish real face sequence IrealAnd synthesizing the face sequence IfakeThe loss can be expressed in particular as:
Figure BDA0002188460140000051
the confrontation loss of the 3D discrimination network consists of two cost points of classification loss of real face sequences and classification loss of synthetic face sequences.
(2)3D discrimination network can correctly classify real face sequence samples IrealThe short-term fatigue state information and the cross-loss entropy function of the soft max classifier are specifically expressed as follows:
Figure BDA0002188460140000052
wherein
Figure BDA0002188460140000053
The classification score, alpha, representing the jth fatigue-related statej'A weight parameter representing the fatigue state of different properties.
The training loss of the 3D discriminative network is a weighted combination of the losses of different learning tasks, and the final loss function can be expressed as:
Figure BDA0002188460140000054
wherein
Figure BDA0002188460140000056
Weight parameters representing different losses in the 3D discriminatory network.
Step 307: training the 3D condition generates a countermeasure network. A network model is built by using a Pythrch open source tool, the training process of the whole network model runs on an Intel Core I7 server, and a NVIDIA TITAN X GPU and an Ubuntu18.04 operating system are used. Using Adam's algorithm to optimize network parameters, 3D generation of a countermeasure network was used only to generate face sequences and to discern the authenticity of the sequences, i.e. during the initial K training roundsAndis set to 0 and then the weight parameters are adjusted to extract spatio-temporal features and classify short-term fatigue state information.
And 4, step 4: and training a bidirectional long-short term memory network for obtaining long-term space-time correlation information so as to realize final fatigue classification.
Step 401: the long-short term memory cells are the basic units of the recurrent neural network structure, as shown in fig. 3. The LSTM unit comprises a memory cell and three control gates, three control gate input gates, a forgetting gate and an output gate. The input gate i (t) may modulate the input z (t) of the LSTM unit. The memory unit c (t) records the current memory state. The output h (t) of the LSTM unit is determined by the forgetting gate f (t) and the output gate o (t). For N continuous frames in the video, short-term space-time feature representation can be obtained through the steps 2 and 3
Figure BDA0002188460140000061
And the bidirectional long-short term memory network takes the continuous short-term space-time characteristics extracted in the step 3 as input and outputs the fatigue score of each frame of image. The one-way LSTM operation process can be expressed as:
Figure BDA0002188460140000062
Figure BDA0002188460140000063
z(t)=tanh(WzX(t)+Rzh(t-1)+bz) (16)
Figure BDA0002188460140000066
where W represents the weight matrix of the current state input, R represents the weight matrix of the last state output, and b represents the threshold term. Sigma is sigmoid function, tanh is double tangent function,
Figure BDA0002188460140000067
representing the elemental inner product. The output of the LSTM unit depends on the current state and the previous state, i.e., spatio-temporal fusion between sequences is achieved.
Step 402: the bidirectional long-short term memory network comprises a forward LSTM unit and a backward LSTM unit, the outputs of the forward LSTM unit and the backward LSTM unit are respectively
Figure BDA0002188460140000068
And
Figure BDA0002188460140000069
and a final fatigue scoreI.e., the fusion score of forward LSTM and backward LSTM, can be expressed as:
Figure BDA00021884601400000610
wherein
Figure BDA00021884601400000611
Indicating the bitwise addition operation of the matrix elements, and y (t) indicating the fatigue score of the final output. The overall structure of the bidirectional long-short term memory network is shown in fig. 4.
Step 403: and training the bidirectional long-short term memory network. A network model is built by using a Pythrch open source tool, the training process of the whole network model runs on an Intel Core I7 server, and a NVIDIA TITAN X GPU and an Ubuntu18.04 operating system are used. And (3) inputting the short-term space-time characteristics output in the step (3) into the bidirectional long-term and short-term memory network, and outputting the final fatigue score.
And 5: the fatigue detection method based on the generation countermeasure and long-short term memory network provided by the invention is tested, and the overall schematic diagram of the framework is shown in FIG. 5. And (3) giving a test video, acquiring a face sequence through the step (2), and generating a confrontation network model by using the 3D condition trained in the step (3) to obtain short-term space-time characteristic representation. And (3) performing long-term space-time feature fusion by using the two-way long-short term memory network trained in the step (3), and finally outputting a fatigue identification result of each frame in the video.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
figure 2 is a sample graph of a fatigue driving data set in the present invention,
figure 3 is a schematic diagram of a 3D conditional generation countermeasure network of the present invention,
figure 4 is a schematic diagram of an LSTM unit of the present invention,
FIG. 5 is a schematic diagram of the bidirectional long-term and short-term memory network structure of the present invention,
FIG. 6 is a schematic diagram of a fatigue detection algorithm framework in the present invention.
Detailed Description
The present invention will be further described with reference to the following detailed description and the accompanying drawings, it being understood that the preferred embodiments described herein are merely illustrative and explanatory of the invention and are not restrictive thereof.
The method comprises the following specific implementation steps:
step 1: a driver fatigue detection dataset is acquired. The present invention uses the disclosed NTHU-DDD fatigue test dataset. The data set contained 360 training videos (722223 frames) and 20 test videos (173259 frames), as shown in FIG. 1. All videos are recorded by an infrared camera in an indoor simulated driving environment. The participants record the two driving modes of normal driving and fatigue driving under different environments. The scene environment includes: no glasses, sunglasses, glasses, and night. The recorded video has a resolution of 640 x 480 and a frame rate of 30 fps. Each video in the data set has four label files, and the fatigue states of each frame of image in the video are recorded, including eye states (normal and closed eyes), mouth states (normal, yawning and talking), and head states (normal, and head drooping without visual contact).
In the present invention, all 360 training videos of the data set are used for training 3D conditions to generate a countermeasure network and a bidirectional long-short term memory network, and the rest 20 videos are used for model testing.
Step 2: and designing a face detection tracking algorithm. In a fatigue detection system, the determination of the fatigue state depends entirely on the face region, whereas the background region is redundant in fatigue detection. In the invention, a method of detecting and tracking is used for acquiring the face area of each frame in the video. In the initial frame of the video, the MTCNN open source algorithm is used for detecting the human face, and in the subsequent frame, the kernel correlation filtering algorithm is used for tracking the human face area.
And step 3: training 3D conditions to generate the confrontation network, wherein the network model consists of a 3D coding and decoding generation network and a 3D discrimination network, as shown in FIG. 2.
Step 301: the 3D coding and decoding generation network takes a U-NET network as a skeleton network, and the input of the network is a three-channel face sequence of continuous adjacent T frames, and the size of the three-channel face sequence is 3 multiplied by T multiplied by 64. Through 3D coding and decoding, the output is a synthesized face sequence, and the size of the synthesized face sequence is the same as that of the input real face sequence. In the coding sub-network, convolution kernels of size 3 × 3 × 3 are applied to multiple 3D convolutional layers, learning a global spatio-temporal feature representation. The global average pooling layer maps the 3D convolved feature map into a 512-dimensional feature vector. The operation process of the coding sub-network can be specifically expressed as:
X=Gen(Irealen), (1)
wherein IrealRepresenting the input real face sequence, thetaenRepresenting parameters of the coding sub-network and X representing the output synthetic face sequence.
Step 302: the tag information is embedded as a condition into the feature vector of the encoding subnetwork output. Specifically, noise coding n, class label coding l is concatenated with the output X of the coding network as input to the decoding subnetwork. Where the noise code n consists of a random noise vector of 100 dimensions. The label code l is composed of fatigue-related class information cascade, and specifically comprises a fatigue state label ldrowLight condition label lillGlasses wearing label lglaEye status label leyeMouth status label lmouHead status label lheadThe specific label encoding method is shown in table 1.
TABLE 1 fatigue State tag encoding scheme
Figure BDA0002188460140000091
Step 303: the decoding generation sub-network is composed of a plurality of 3D deconvolution networks, the size of a deconvolution kernel is 3 multiplied by 3, the decoding generation sub-network performs up-sampling on the coding labels, and finally a synthetic face sequence is generated and input into the 3D discrimination network. In the 3D generation network, the coding sub-network and the decoding sub-network adopt a residual connection mode, so that the synthesized face clip can keep detailed space-time information. The operation process of decoding the sub-network can be specifically expressed as:
Ifake=Gde(X,n,l|θgen), (2)
wherein IfakeIndicating decoding of the face clip synthesized by the sub-network. Thetagen={θendecDenotes parameters of the 3D conditional access network, including the encoded network parameter θenAnd decoding the network parameter thetade
The whole 3D codec generation network can be regarded as a mapping from an input real face sequence to an output synthesized face sequence, which can be expressed by the following formula.
Ifake=G(Ireal,n,l|θgen). (3)
Step 304: 3D discrimination network simultaneously inputting real face sequence IrealAnd synthesizing the face sequence Ifake. The network is similar to the coding sub-network in the 3D conditional generation network, is composed of multiple 3D convolutional layers, and includes a max-means pooling layer, a full connection layer, and a softmax layer. The 3D discrimination network adopts a multi-task learning strategy to extract short-term space-time characteristic representation, discriminate the authenticity of the face sequence and classify fatigue related state information. The operation process of the 3D discrimination network can be specifically expressed as:
Figure BDA0002188460140000101
Figure BDA0002188460140000102
wherein I ═ { I ═ Ireal,IfakeThe method is input by a 3D discrimination network and comprises a real face sequence IrealAnd synthesizing the face sequence Ifake。θdisThe parameters of the network are discriminated for 3D,
Figure BDA0002188460140000103
is a 512-dimensional spatio-temporal feature representation. softmax (. |) represents the softmax sort operation, θclsRepresenting the relevant parameters of the softmax classifier. score represents a classification score, and includes a result of classifying the authenticity judgment score and fatigue-related state of the face sequence.
Step 305: 3D rawNetwork learning short-term spatio-temporal information and generating a composite face sequence for a given input face sequence and its corresponding state label l ═ ldrow,lill,lgla,leye,lmou,lheadThe 3D generation network comprises the following training tasks:
(1)3D Generation network Generation of synthetic face sequences IrealSo that the 3D discrimination network cannot determine its authenticity, which can be specifically expressed as:
wherein G (-) denotes the generation of a network-synthesized face sequence, Drealness(. cndot.) represents the authenticity score,
Figure BDA0002188460140000105
representing the competing losses of the generating network.
(2)3D generation network approximation input I by regression lossrealAnd an output IfakeThe distance between, similar to a self-coding neural network, the loss can be expressed as:
Figure BDA0002188460140000106
where | · | | represents the two-norm distance of the real face sequence and the synthetic face sequence. Loss of return
Figure BDA0002188460140000107
Can improve the authenticity of the synthesized face sequence, thereby enhancing the performance of the 3D discrimination network,
(4) face sequence I for 3D generation network synthesisrealThe 3D discrimination network can accurately classify the short-term fatigue state information, and the cross loss entropy function optimization softmax classifier can be specifically expressed as follows:
Figure BDA0002188460140000111
wherein
Figure BDA0002188460140000112
The classification score, alpha, representing the jth fatigue-related statej'A weight parameter representing the fatigue state of different properties.
The training loss of the 3D generation network is a weighted combination of the losses of the different learning tasks, and the final loss function can be expressed as:
Figure BDA0002188460140000113
wherein
Figure BDA0002188460140000114
Weight parameters representing different losses in the 3D generating network.
Step 306: the 3D discriminative network can be regarded as a multitask 3D convolutional neural network, mainly including the following two tasks:
(1)3D discrimination network can correctly distinguish real face sequence IrealAnd synthesizing the face sequence IfakeThe loss can be expressed in particular as:
the confrontation loss of the 3D discrimination network consists of two cost points of classification loss of real face sequences and classification loss of synthetic face sequences.
(2)3D discrimination network can correctly classify real face sequence samples IrealThe short-term fatigue state information and the cross-loss entropy function of the soft max classifier are specifically expressed as follows:
Figure BDA0002188460140000116
wherein
Figure BDA0002188460140000117
A classification score representing the jth fatigue-related state,αj'a weight parameter representing the fatigue state of different properties.
The training loss of the 3D discriminative network is a weighted combination of the losses of different learning tasks, and the final loss function can be expressed as:
Figure BDA0002188460140000118
wherein
Figure BDA0002188460140000119
Figure BDA00021884601400001110
Weight parameters representing different losses in the 3D discriminatory network.
Step 307: training the 3D condition generates a countermeasure network. A network model is built by using a Pythrch open source tool, the training process of the whole network model is run on an Intel Core I7 server, and an operating system NVIDIATITAN X GPU and Ubuntu18.04 are used. Using Adam's algorithm to optimize network parameters, 3D generation of a countermeasure network was used only to generate face sequences and to discern the authenticity of the sequences, i.e. during the initial K training rounds
Figure BDA0002188460140000128
Andis set to 0 and then the weight parameters are adjusted to extract spatio-temporal features and classify short-term fatigue state information.
And 4, step 4: and training a bidirectional long-short term memory network for obtaining long-term space-time correlation information so as to realize final fatigue classification.
Step 401: the long-short term memory cells are the basic units of the recurrent neural network structure, as shown in fig. 3. The LSTM unit comprises a memory cell and three control gates, three control gate input gates, a forgetting gate and an output gate. The input gate i (t) may modulate the input z (t) of the LSTM unit. The memory unit c (t) records the current memory state. The output h (t) of the LSTM unit is determined by a forgetting gate f (t) and an output gate o (t)And (4) determining. For N continuous frames in the video, short-term space-time feature representation can be obtained through the steps 2 and 3
Figure BDA0002188460140000121
And the bidirectional long-short term memory network takes the continuous short-term space-time characteristics extracted in the step 3 as input and outputs the fatigue score of each frame of image. The one-way LSTM operation process can be expressed as:
Figure BDA0002188460140000122
Figure BDA0002188460140000123
Figure BDA0002188460140000124
z(t)=tanh(WzX(t)+Rzh(t-1)+bz) (16)
Figure BDA0002188460140000125
Figure BDA0002188460140000126
where W represents the weight matrix of the current state input, R represents the weight matrix of the last state output, and b represents the threshold term. Sigma is sigmoid function, tanh is double tangent function,
Figure BDA0002188460140000127
representing the elemental inner product. The output of the LSTM unit depends on the current state and the previous state, i.e., spatio-temporal fusion between sequences is achieved.
Step 402: the bidirectional long-short term memory network comprises a forward LSTM unit and a backward LSTM unit, the outputs of the forward LSTM unit and the backward LSTM unit are respectivelyAndthe final fatigue score, i.e. the fusion score of the forward LSTM and the backward LSTM, can be specifically expressed as:
Figure BDA0002188460140000133
whereinIndicating the bitwise addition operation of the matrix elements, and y (t) indicating the fatigue score of the final output. The overall structure of the bidirectional long-short term memory network is shown in fig. 4.
Step 403: and training the bidirectional long-short term memory network. A network model is built by using a Pythrch open source tool, the training process of the whole network model is run on an Intel Core I7 server, and an operating system NVIDIATITAN X GPU and Ubuntu18.04 are used. And (3) inputting the short-term space-time characteristics output in the step (3) into the bidirectional long-term and short-term memory network, and outputting the final fatigue score.
And 5: the fatigue detection method based on the generation countermeasure and long-short term memory network provided by the invention is tested, and the overall schematic diagram of the framework is shown in FIG. 5. And (3) giving a test video, acquiring a face sequence through the step (2), and generating a confrontation network model by using the 3D condition trained in the step (3) to obtain short-term space-time characteristic representation. And (3) performing long-term space-time feature fusion by using the two-way long-short term memory network trained in the step (3), and finally outputting a fatigue identification result of each frame in the video.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (1)

1. The fatigue detection method based on the generation countermeasure and the long-short term memory network is characterized by comprising the following steps:
step 1: acquiring a driver fatigue detection data set: the disclosed NTHU-DDD fatigue detection data set is used, the data set comprises 360 training videos and 20 testing videos, all videos are recorded by an infrared camera in an indoor simulated driving environment, participants record two driving modes of normal driving and fatigue driving under different environments, and the scene environment comprises: glasses are not worn in the daytime, glasses are worn in the daytime, sunglasses are worn in the daytime, glasses are not worn at night and glasses are worn at night, the resolution of the recorded video is 640 multiplied by 480, and the frame rate is 30 fps; each video in the data set has four annotation files, and the fatigue state of each frame of image in the video is recorded, including the eye state: normal, closed eye, mouth state: normal, yawning, conversation, head state: normal, no visual inspection ahead, head drooping; all 360 training videos of the data set are used for training 3D conditions to generate an antagonistic network and a bidirectional long-short term memory network, and the rest 20 videos are used for model testing;
step 2: designing a face detection tracking algorithm: acquiring a face area of each frame in a video by using a detection and tracking combined method, detecting a face by using an MTCNN open source algorithm in an initial frame of the video, and tracking the face area by using a kernel correlation filtering algorithm in a subsequent frame;
and step 3: training a 3D condition to generate a confrontation network, wherein the network model consists of a 3D coding and decoding generation network and a 3D discrimination network, and the specific steps are as follows:
step 301: the 3D coding and decoding generation network takes a U-NET network as a skeleton network, the input of the network is a three-channel face sequence of continuous adjacent T frames, the size is 3 multiplied by T multiplied by 64, the output of the network is a synthesized face sequence through 3D coding and decoding, the size is the same as that of the input real face sequence, in a coding sub-network, a convolution kernel with the size of 3 multiplied by 3 is applied to a plurality of 3D convolution layers, the global space-time feature representation is learned, a global average pooling layer maps a 3D convolution feature map into 512-dimensional feature vectors, and the operation process of a coding sub-network can be specifically expressed as follows:
X=Gen(Irealen), (1)
wherein IrealRepresenting the input real face sequence, thetaenParameters representing a coding sub-network, X represents an output synthetic face sequence;
step 302: the tag information is embedded as a condition into the feature vector of the encoding subnetwork output. Specifically, noise code n, category label code l and the output X of the coding network are cascaded to serve as the input of a decoding sub-network, wherein the noise code n is composed of random noise vectors with 100 dimensions, and the label code l is composed of fatigue-related category information cascade and specifically comprises a fatigue state label ldrowLight condition label lillGlasses wearing label lglaEye status label leyeMouth status label lmouHead status label lhead
Step 303: the decoding generated sub-network route is composed of a plurality of 3D deconvolution networks, the size of a deconvolution kernel is 3 multiplied by 3, the decoding generated sub-network carries out up-sampling on the coding label and finally generates a synthesized face sequence and inputs the synthesized face sequence into the 3D discrimination network, in the 3D generation network, the coding sub-network and the decoding sub-network adopt a residual error connection mode, so that the synthesized face clip can keep detailed space-time information, and the operation process of the decoding sub-network can be specifically expressed as:
Ifake=Gde(X,n,l|θgen), (2)
wherein IfakeRepresenting face clips, theta, synthesized by decoding sub-networksgen={θendecDenotes parameters of the 3D conditional access network, including the encoded network parameter θenAnd decoding the network parameter thetade
The whole 3D codec generation network can be regarded as a mapping from an input real face sequence to an output synthesized face sequence, which can be expressed by the following formula.
Ifake=G(Ireal,n,l|θgen). (3)
Step 304: 3D discrimination network simultaneously inputting real face sequence IrealAnd synthesizing the face sequence Ifake: the operation process of the 3D discrimination network can be specifically expressed as:
Figure FDA0002188460130000021
Figure FDA0002188460130000022
wherein I ═ { I ═ Ireal,IfakeThe method is input by a 3D discrimination network and comprises a real face sequence IrealAnd synthesizing the face sequence Ifake。θdisThe parameters of the network are discriminated for 3D,
Figure FDA0002188460130000023
is a space-time characteristic representation with 512 dimensions, softmax (· |) represents softmax classification operation, thetaclsExpressing relevant parameters of a softmax classifier, and expressing classification scores by score, wherein the classification results comprise authenticity judgment scores of the face sequences and fatigue relevant states;
step 305: 3D generation network learns short-term spatiotemporal information and generates a synthesized face sequence for a given input face sequence and its corresponding state label l ═ { l ═ fordrow,lill,lgla,leye,lmou,lheadThe 3D generation network comprises the following training tasks:
(1)3D Generation network Generation of synthetic face sequences IrealSo that the 3D discrimination network cannot determine its authenticity, which can be specifically expressed as:
Figure FDA0002188460130000031
wherein G (-) denotes the generation of a network-synthesized face sequence, Drealness(. cndot.) represents the authenticity score,
Figure FDA0002188460130000032
representing the countermeasure loss of the generating network;
(2)3D rawInput I is approximated by regression loss in network formationrealAnd an output IfakeThe distance between, similar to a self-coding neural network, the loss can be expressed as:
Figure FDA0002188460130000033
wherein | · | | represents the two-norm distance between the real face sequence and the synthesized face sequence;
(3) face sequence I for 3D generation network synthesisrealThe 3D discrimination network can accurately classify the short-term fatigue state information, and the cross loss entropy function optimization softmax classifier can be specifically expressed as follows:
wherein
Figure FDA0002188460130000035
The classification score, alpha, representing the jth fatigue-related statej'Weight parameters representing fatigue states of different attributes;
the training loss of the 3D generation network is a weighted combination of the losses of the different learning tasks, and the final loss function can be expressed as:
Figure FDA0002188460130000036
wherein
Figure FDA0002188460130000037
A weight parameter representing different losses in the 3D generation network;
step 306: the 3D discriminative network can be regarded as a multitask 3D convolutional neural network, mainly including the following two tasks:
(1)3D discrimination network can correctly distinguish real face sequence IrealAnd synthesizing the face sequence IfakeThe loss can be expressed in particular as:
Figure FDA0002188460130000038
the confrontation loss of the 3D discrimination network consists of two cost groups of classification loss of real face sequences and classification loss of synthesized face sequences;
(2)3D discrimination network can correctly classify real face sequence samples IrealThe short-term fatigue state information and the cross-loss entropy function of the soft max classifier are specifically expressed as follows:
Figure FDA0002188460130000041
wherein
Figure FDA0002188460130000042
The classification score, alpha, representing the jth fatigue-related statej'Weight parameters representing fatigue states of different attributes;
the training loss of the 3D discriminative network is a weighted combination of the losses of different learning tasks, and the final loss function can be expressed as:
Figure FDA0002188460130000043
wherein
Figure FDA0002188460130000044
Weight parameters representing different losses in the 3D discriminative network;
step 307: training a 3D condition to generate an antagonistic network, constructing a network model by using a Pythrch open source tool, running the training process of the whole network model on an Intel Core I7 server, using a NVIDIA TITAN X GPU and a Ubuntu18.04 operating system, optimizing network parameters by using an Adam algorithm, and in initial K training rounds, generating the antagonistic network by 3D only for generating the authenticity of a face sequence and a recognition sequence, namely
Figure FDA0002188460130000045
And
Figure FDA0002188460130000046
set to 0, and then adjust the weight parameters to extract spatio-temporal features and classify short-term fatigue state information;
and 4, step 4: training a bidirectional long-short term memory network for obtaining long-term space-time correlation information to realize final fatigue classification, and specifically comprising the following steps:
step 401: the LSTM unit comprises a memory cell and three control gates, three control gates including an input gate, a forgetting gate and an output gate, wherein the input gate i (t) can modulate the input z (t) of the LSTM unit, the memory unit c (t) records the current memory state, the output h (t) of the LSTM unit is jointly determined by the forgetting gate f (t) and the output gate o (t), and short-term space-time characteristic representation can be obtained through the steps 2 and 3 for continuous N frames in a video
Figure FDA0002188460130000047
The bidirectional long-short term memory network takes the continuous short-term space-time characteristics extracted in the step 3 as input and outputs the fatigue score of each frame of image, and the unidirectional LSTM operation process can be expressed as follows:
Figure FDA0002188460130000048
Figure FDA0002188460130000051
z(t)=tanh(WzX(t)+Rzh(t-1)+bz) (16)
Figure FDA0002188460130000054
where W represents the weight matrix of the current state input, R represents the weight matrix of the last state output, and b represents the threshold term. Sigma is sigmoid function, tanh is double tangent function,representing the inner product of elements, wherein the output of the LSTM unit depends on the current state and the previous state, namely, the space-time fusion between sequences is realized;
step 402: the bidirectional long-short term memory network comprises a forward LSTM unit and a backward LSTM unit, the outputs of the forward LSTM unit and the backward LSTM unit are respectively
Figure FDA0002188460130000056
Andthe final fatigue score, i.e. the fusion score of the forward LSTM and the backward LSTM, can be specifically expressed as:
Figure FDA0002188460130000058
wherein
Figure FDA0002188460130000059
Representing a bitwise addition operation of the matrix elements, and Y (t) representing the finally output fatigue score;
step 403: training a bidirectional long and short term memory network, building a network model by using a Pythrch open source tool, running the training process of the whole network model on an Intel Core I7 server, using NVIDIA TITAN X GPU and a Ubuntu18.04 operating system, inputting the input of the bidirectional long and short term memory network into the short term space-time characteristics output in the step 3, and outputting the input into a final fatigue score;
and 5: testing the fatigue detection method based on the generation of the confrontation and long-short term memory network, giving a test video, obtaining a face sequence through the step 2, generating a confrontation network model by using the 3D condition trained in the step 3 to obtain short-term space-time feature representation, performing long-term space-time feature fusion by using the bidirectional long-short term memory network trained in the step 3, and finally outputting a fatigue identification result of each frame in the video.
CN201910824620.XA 2019-09-02 2019-09-02 Driver fatigue detection method based on generation countermeasure and long-short term memory network Active CN110717389B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910824620.XA CN110717389B (en) 2019-09-02 2019-09-02 Driver fatigue detection method based on generation countermeasure and long-short term memory network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910824620.XA CN110717389B (en) 2019-09-02 2019-09-02 Driver fatigue detection method based on generation countermeasure and long-short term memory network

Publications (2)

Publication Number Publication Date
CN110717389A true CN110717389A (en) 2020-01-21
CN110717389B CN110717389B (en) 2022-05-13

Family

ID=69210231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910824620.XA Active CN110717389B (en) 2019-09-02 2019-09-02 Driver fatigue detection method based on generation countermeasure and long-short term memory network

Country Status (1)

Country Link
CN (1) CN110717389B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695435A (en) * 2020-05-19 2020-09-22 东南大学 Driver behavior identification method based on deep hybrid coding and decoding neural network
CN111882825A (en) * 2020-06-18 2020-11-03 闽江学院 Fatigue prediction method and device based on electroencephalogram-like wave data
CN112016459A (en) * 2020-08-28 2020-12-01 上海大学 Driver action recognition method based on self-attention mechanism
CN112101103A (en) * 2020-08-07 2020-12-18 东南大学 Video driver fatigue detection method based on deep integration network
CN112329689A (en) * 2020-11-16 2021-02-05 北京科技大学 Abnormal driving behavior identification method based on graph convolution neural network under vehicle-mounted environment
CN112418409A (en) * 2020-12-14 2021-02-26 南京信息工程大学 Method for predicting time-space sequence of convolution long-short term memory network improved by using attention mechanism
WO2021197135A1 (en) * 2020-04-03 2021-10-07 The University Of Hong Kong Da-bd-lstm-dense-unet for liver lesion segmentation
CN114403878A (en) * 2022-01-20 2022-04-29 南通理工学院 Voice fatigue detection method based on deep learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334832A (en) * 2018-01-26 2018-07-27 深圳市唯特视科技有限公司 A kind of gaze estimation method based on generation confrontation network
CN108596149A (en) * 2018-05-10 2018-09-28 上海交通大学 The motion sequence generation method for generating network is fought based on condition
CN109124625A (en) * 2018-09-04 2019-01-04 大连理工大学 A kind of driver fatigue state horizontal mipmap method
CN109770925A (en) * 2019-02-03 2019-05-21 闽江学院 A kind of fatigue detection method based on depth time-space network
CN109820525A (en) * 2019-01-23 2019-05-31 五邑大学 A kind of driving fatigue recognition methods based on CNN-LSTM deep learning model
CN109886241A (en) * 2019-03-05 2019-06-14 天津工业大学 Driver fatigue detection based on shot and long term memory network
CN110135305A (en) * 2019-04-30 2019-08-16 百度在线网络技术(北京)有限公司 Method, apparatus, equipment and medium for fatigue strength detection

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334832A (en) * 2018-01-26 2018-07-27 深圳市唯特视科技有限公司 A kind of gaze estimation method based on generation confrontation network
CN108596149A (en) * 2018-05-10 2018-09-28 上海交通大学 The motion sequence generation method for generating network is fought based on condition
CN109124625A (en) * 2018-09-04 2019-01-04 大连理工大学 A kind of driver fatigue state horizontal mipmap method
CN109820525A (en) * 2019-01-23 2019-05-31 五邑大学 A kind of driving fatigue recognition methods based on CNN-LSTM deep learning model
CN109770925A (en) * 2019-02-03 2019-05-21 闽江学院 A kind of fatigue detection method based on depth time-space network
CN109886241A (en) * 2019-03-05 2019-06-14 天津工业大学 Driver fatigue detection based on shot and long term memory network
CN110135305A (en) * 2019-04-30 2019-08-16 百度在线网络技术(北京)有限公司 Method, apparatus, equipment and medium for fatigue strength detection

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JING-MING GUO等: ""Driver drowsiness detection using hybrid convolutional neural network and long short-term memory"", 《SPRINGERLINK》 *
耿磊等: ""基于多形态红外特征与深度学习的实时驾驶员疲劳检测"", 《红外与激光工程》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021197135A1 (en) * 2020-04-03 2021-10-07 The University Of Hong Kong Da-bd-lstm-dense-unet for liver lesion segmentation
CN111695435B (en) * 2020-05-19 2022-04-29 东南大学 Driver behavior identification method based on deep hybrid coding and decoding neural network
CN111695435A (en) * 2020-05-19 2020-09-22 东南大学 Driver behavior identification method based on deep hybrid coding and decoding neural network
CN111882825B (en) * 2020-06-18 2021-05-28 闽江学院 Fatigue prediction method and device based on electroencephalogram-like wave data
CN111882825A (en) * 2020-06-18 2020-11-03 闽江学院 Fatigue prediction method and device based on electroencephalogram-like wave data
CN112101103A (en) * 2020-08-07 2020-12-18 东南大学 Video driver fatigue detection method based on deep integration network
CN112101103B (en) * 2020-08-07 2022-08-09 东南大学 Video driver fatigue detection method based on deep integration network
CN112016459A (en) * 2020-08-28 2020-12-01 上海大学 Driver action recognition method based on self-attention mechanism
CN112016459B (en) * 2020-08-28 2024-01-16 上海大学 Driver action recognition method based on self-attention mechanism
CN112329689A (en) * 2020-11-16 2021-02-05 北京科技大学 Abnormal driving behavior identification method based on graph convolution neural network under vehicle-mounted environment
CN112418409A (en) * 2020-12-14 2021-02-26 南京信息工程大学 Method for predicting time-space sequence of convolution long-short term memory network improved by using attention mechanism
CN112418409B (en) * 2020-12-14 2023-08-22 南京信息工程大学 Improved convolution long-short-term memory network space-time sequence prediction method by using attention mechanism
CN114403878A (en) * 2022-01-20 2022-04-29 南通理工学院 Voice fatigue detection method based on deep learning

Also Published As

Publication number Publication date
CN110717389B (en) 2022-05-13

Similar Documents

Publication Publication Date Title
CN110717389B (en) Driver fatigue detection method based on generation countermeasure and long-short term memory network
US11783601B2 (en) Driver fatigue detection method and system based on combining a pseudo-3D convolutional neural network and an attention mechanism
Lyu et al. Long-term multi-granularity deep framework for driver drowsiness detection
CN108294759A (en) A kind of Driver Fatigue Detection based on CNN Eye state recognitions
CN106295568A (en) The mankind's naturalness emotion identification method combined based on expression and behavior bimodal
CN112101103B (en) Video driver fatigue detection method based on deep integration network
CN109543526A (en) True and false facial paralysis identifying system based on depth difference opposite sex feature
CN108345894B (en) A kind of traffic incidents detection method based on deep learning and entropy model
CN112131981B (en) Driver fatigue detection method based on skeleton data behavior recognition
KR102132407B1 (en) Method and apparatus for estimating human emotion based on adaptive image recognition using incremental deep learning
CN111860274A (en) Traffic police command gesture recognition method based on head orientation and upper half body skeleton characteristics
Ezzouhri et al. Robust deep learning-based driver distraction detection and classification
CN113065515B (en) Abnormal behavior intelligent detection method and system based on similarity graph neural network
CN111274886B (en) Deep learning-based pedestrian red light running illegal behavior analysis method and system
CN113378649A (en) Identity, position and action recognition method, system, electronic equipment and storage medium
Yan et al. Recognizing driver inattention by convolutional neural networks
CN113688761A (en) Pedestrian behavior category detection method based on image sequence
CN114022726A (en) Personnel and vehicle monitoring method and system based on capsule network
Li et al. Monitoring and alerting of crane operator fatigue using hybrid deep neural networks in the prefabricated products assembly process
CN109886102A (en) A kind of tumble behavior Spatio-temporal domain detection method based on depth image
CN114663807A (en) Smoking behavior detection method based on video analysis
CN114220158A (en) Fatigue driving detection method based on deep learning
CN113408389A (en) Method for intelligently recognizing drowsiness action of driver
Al-Shakarchy et al. Detecting abnormal movement of driver's head based on spatial-temporal features of video using deep neural network DNN
CN115308768A (en) Intelligent monitoring system under privacy environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant