CN114303177A

CN114303177A - System and method for generating video data sets with different fatigue degrees through transfer learning

Info

Publication number: CN114303177A
Application number: CN201980097422.6A
Authority: CN
Inventors: 贾程程; 杨磊
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2022-04-08
Also published as: WO2020226696A1; EP4042318A1

Abstract

The present disclosure relates to techniques for training applications to identify driver fatigue. Facial expression images are reconstructed from the first representation of the image learned from the first neural network. Based on the first representation, a second neural network is used to generate an image expressing the current level or fatigue in images generated at previous intervals or levels. The images are reconstructed from the first representation and a second representation learned from the second neural network, and intermediate images of interpolated video data are generated from corresponding optical flows of the images, wherein the optical flows are formed by fusing together the images in a time frame between the images. A false fatigue status video of the driver is compiled from the data to train an application therein to detect driver fatigue.

Description

System and method for generating video data sets with different fatigue degrees through transfer learning

Technical Field

The present disclosure relates generally to detection of driver fatigue, and in particular to generating video data sets for training applications for use in identifying when a driver is tired.

Background

Driver fatigue or drowsiness is increasingly becoming a common cause of vehicle accidents. Driver drowsiness detection and monitoring is critical to ensure a safe driving environment, not only for a drowsy driver, but also for other drivers nearby that may be affected by the drowsy driver. Vehicles with the ability to monitor drivers allow the vehicle to take steps to prevent or assist in preventing accidents due to drowsiness of the driver. For example, a warning system may be activated to alert the driver that it is drowsy, or automatic functions, such as braking and steering, may be activated to control the vehicle until the driver is no longer drowsy. However, there are few common data sets that can train an application to perform such detection and monitoring for a particular driver, where each driver has its own personal ability to withstand various levels of fatigue, as well as different indicators to show various levels of drowsiness for a particular driver. Therefore, if the drowsiness state of the driver is determined according to a single criterion, the driver detection and monitoring system may be over-responsive or under-responsive, which may not improve the safety of the driver.

Disclosure of Invention

According to one aspect of the present disclosure, there is a computer-implemented method for training an application to identify driver fatigue: generating a plurality of first facial expression images from a plurality of second facial expression images using a first neural network, wherein the plurality of first facial expression images are reconstructed from first representations of the plurality of second facial expression images learned from the first neural network; generating a first image expressing a current fatigue degree from a third facial expression image and a second image expressing a fatigue degree before the current fatigue degree using a second neural network based on the first representation, wherein the first image and the second image are reconstructed from the first representation and a second representation of the third facial expression image learned from the second neural network; generating a plurality of intermediate images of interpolated video data from the first image and the second image during respective optical flows, wherein the optical flows are formed by fusing the first image and the second image and are located in a time frame between the first image and the second image; and compiling a false fatigue state video of a driver using at least the first and second images and the plurality of intermediate images of the interpolated video data to train the application therein to detect the driver fatigue.

Optionally, in any preceding aspect, wherein the first neural network performs the steps of: mapping the plurality of second facial expression images to respective first representations; and mapping the respective first representations to the plurality of first facial expression images having the same expression as the plurality of second facial expression images.

Optionally, in any preceding aspect, wherein the second neural network comprises a conditional variational auto-encoder performing the steps of: encoding the third facial expression image and the second image and outputting a parameter describing a distribution of each dimension of the second representation; and decoding the distribution for each dimension of the second representation by calculating a relationship of each parameter with respect to output loss to reconstruct the third facial expression image and the second image.

Optionally, in any preceding aspect, wherein the second neural network further comprises a Generative Adaptive Network (GAN) that performs the following steps: comparing the reconstructed image to the third facial expression image to generate a discriminator loss; comparing the reconstructed image with ground truth images at the same degree to generate a reconstruction loss; predicting a likelihood that the reconstructed image has an appearance corresponding to the third facial expression image based on the discriminator loss and the reconstruction loss; and, when the prediction classifies the first image as a real image, outputting the reconstructed image as the first image expressing a current fatigue degree and inputting the reconstructed image to the condition-variant automatic encoder as the second image expressing a fatigue degree before the current fatigue degree.

Optionally, in any preceding aspect, wherein the reconstruction loss indicates a degree of dissimilarity between the third facial expression image and the reconstructed image, and the discriminator loss indicates a cost of generating an incorrect prediction that the reconstructed image has the appearance of the third facial expression image.

Optionally, in any preceding aspect, the computer-implemented method further comprises: iteratively generating the first image at different degrees of fatigue according to differences between the first image and the second image at different time frames until a total value of the reconstruction loss and the discriminator loss meets a predetermined criterion.

Optionally, in any preceding aspect, wherein generating the plurality of intermediate images further comprises: predicting an intermediate image between the first image and the second image during the respective optical flows; and interpolating the first image and the second image to generate the respective optical flows to generate the false fatigue state video of the driver therein.

Optionally, in any preceding aspect, wherein generating the plurality of intermediate images further comprises: receiving a sequence of intermediate images arranged in an input order; processing the sequence of intermediate images using an encoder to convert the sequence of intermediate images into an alternative representation of the sequence of intermediate images; and processing the alternative representation of the sequence of intermediate images using a decoder to generate a target sequence of the sequence of intermediate images, the target sequence comprising a plurality of outputs arranged according to an output order.

Optionally, in any preceding aspect, wherein the first representation maps the plurality of second facial expression images to the first representation by a learning distribution.

Optionally, in any preceding aspect, wherein the second representation maps the third facial expression image to the second representation through a learning distribution.

According to one other aspect of the present disclosure, there is provided an apparatus for training an application to identify driver fatigue, the apparatus comprising: a non-transitory memory comprising instructions; and one or more processors in communication with the memory, wherein the one or more processors execute the instructions to: generating a plurality of first facial expression images from a plurality of second facial expression images using a first neural network, wherein the plurality of first facial expression images are reconstructed from first representations of the plurality of second facial expression images learned from the first neural network; generating a first image expressing a current fatigue degree from a third facial expression image and a second image expressing a fatigue degree before the current fatigue degree using a second neural network based on the first representation, wherein the first image and the second image are reconstructed from the first representation and a second representation of the third facial expression image learned from the second neural network; generating a plurality of intermediate images of interpolated video data from the first image and the second image during respective optical flows, wherein the optical flows are formed by fusing the first image and the second image and are located in a time frame between the first image and the second image; and compiling a false fatigue state video of a driver using at least the first and second images and the plurality of intermediate images of the interpolated video data to train the application therein to detect the driver fatigue.

According to yet another aspect of the disclosure, there is a non-transitory computer readable medium storing instructions for training an application to identify driver fatigue, which when executed by one or more processors, cause the one or more processors to perform the steps of: generating a plurality of first facial expression images from a plurality of second facial expression images using a first neural network, wherein the plurality of first facial expression images are reconstructed from first representations of the plurality of second facial expression images learned from the first neural network; generating a first image expressing a current fatigue degree from a third facial expression image and a second image expressing a fatigue degree before the current fatigue degree using a second neural network based on the first representation, wherein the first image and the second image are reconstructed from the first representation and a second representation of the third facial expression image learned from the second neural network; generating a plurality of intermediate images of interpolated video data from the first image and the second image during respective optical flows, wherein the optical flows are formed by fusing the first image and the second image and are located in a time frame between the first image and the second image; and compiling a false fatigue state video of a driver using at least the first and second images and the plurality of intermediate images of the interpolated video data to train the application therein to detect the driver fatigue.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the background.

Drawings

Aspects of the present disclosure are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate elements.

FIG. 1A illustrates a driver monitoring system in accordance with embodiments of the present technique.

FIG. 1B illustrates a detailed example of the driver monitoring system according to FIG. 1A.

Fig. 2 illustrates an example of an expression recognition network.

Fig. 3 illustrates an exemplary facial fatigue level generator network.

Fig. 4A illustrates a video prediction and interpolation network.

Fig. 4B illustrates an exemplary frame interpolation network according to fig. 4A.

Fig. 4C illustrates an example of the video prediction and interpolation network of fig. 4A with an expanded view of the LSTM auto-encoder.

Fig. 5A-5D illustrate example flow diagrams in accordance with embodiments of the present technique.

FIG. 6 illustrates a computing system upon which embodiments of the present disclosure may be implemented.

Detailed Description

The present disclosure will now be described with reference to the accompanying drawings, which generally relate to driver attention detection.

The present technology relates to driver fatigue detection for a particular driver, also known as driver drowsiness, tiredness, and drowsiness, using an application trained from a false fatigue state video data set. Conventional data sets for training applications to detect driver fatigue are typically based on a common data set that is not specific to a single driver. Typically, this results in the application detecting driver fatigue in the absence of driver fatigue, or not detecting driver fatigue in the presence of driver fatigue. In an embodiment, an individualized set of false fatigue state video data associated with a particular or single drive is generated by the disclosed techniques. These data sets are generated by interpolating a sequence of images and predicting the next frame or sequence of images using various machine learning techniques and neural networks.

It should be understood that the present embodiments of the disclosure may be embodied in many different forms and that the scope of the claims should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the embodiments of the invention to those skilled in the art. Indeed, the present disclosure is intended to cover alternatives, modifications, and equivalents of these embodiments, which may be included within the scope and spirit of the disclosure as defined by the appended claims. Furthermore, in the following detailed description of the present embodiments of the disclosure, numerous specific details are set forth in order to provide a thorough understanding. However, it will be apparent to one of ordinary skill in the art that the present embodiments of the disclosure may be practiced without such specific details.

FIG. 1A illustrates a driver distraction system in accordance with embodiments of the present technology. Driver distraction system 106 is shown mounted or otherwise included within vehicle 101, which also includes a cab in which driver 102 may sit. The driver distraction system 106, or one or more portions thereof, may be implemented by an in-cab computer system and/or by a mobile computing device, such as, but not limited to, a smartphone, a tablet, a notebook, a laptop, etc.

According to certain embodiments of the present technique, the driver fatigue system 106 obtains (or collects) current data of the driver 102 of the vehicle 101 from one or more sensors. In other embodiments, the driver fatigue system 106 also obtains (or collects) additional information about the driver 102 from one or more databases 140, as the additional information relates to characteristics of the driver, such as facial features, historical head pose and eye gaze information, and the like. Driver fatigue system 106 analyzes current data and/or additional information of driver 102 of vehicle 101 to identify the head pose and eye gaze of the driver. In one embodiment, the driver fatigue system 106 additionally monitors and collects vehicle data and context information, as described below. Such analysis may be performed using one or more computer-implemented neural networks and/or some other computer-implemented model, as explained below.

As shown in fig. 1A, driver fatigue system 106 is communicatively coupled to capture device 103, which may be used to obtain current data for the driver of vehicle 101 as well as vehicle data and scene information. In one embodiment, the capture device 103 includes sensors and other devices for obtaining current data of the driver 102 of the vehicle 101. The captured data may be processed by the processor 104, which includes hardware and/or software to detect and track driver motion, head pose, and gaze direction. As will be described in additional detail below with reference to fig. 1B, the capture device may additionally include one or more cameras, microphones, or other sensors to capture data. In another embodiment, the capture device 103 may capture forward scenes (e.g., ambient environment and/or scene information) of a route being traveled by the vehicle. The forward facing sensors may include, for example, radar sensors, laser sensors, lidar sensors, optical imaging sensors, and the like. It should be appreciated that the sensors may also cover the sides, rear, and top (both upward and downward facing) of the vehicle 101.

In one embodiment, the capture device 103 may be external to the driver fatigue system 106, as shown in fig. 1A, or may be included as part of the driver fatigue system 106, depending on the particular implementation. Additional details of the driver fatigue system 106 in accordance with certain embodiments of the present technique are described below with reference to FIG. 1B.

Still referring to FIG. 1A, the driver fatigue system 106 is also shown communicatively coupled to various different types of vehicle-related sensors 105 included within the vehicle 101. Such sensors 105 may include, but are not limited to, a speedometer, a Global Positioning System (GPS) receiver, and a clock. Driver fatigue system 106 is also shown communicatively coupled to one or more communication networks 130 that provide access to one or more databases 140 and/or other types of data storage. The database 140 and/or other type of data store may store vehicle data for the vehicle 101. Examples of such data include, but are not limited to, driving record data, driving performance data, driver license type data, driver facial features, driver head pose, driver gaze, and the like. Such data may be stored in a local database or other data store located within vehicle 101. However, the data may be stored in one or more databases 140 or other data stores that are remotely located with respect to the vehicle 101. Accordingly, such databases 140 or other data stores may be communicatively coupled to the driver distraction system through one or more communication networks 130.

The communication network 130 may include a data network, a wireless network, a telephony network, or any combination thereof. It is contemplated that the data network may be any Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), public data network (e.g., the internet), short-range wireless network, or any other suitable packet-switched network. Additionally, the wireless network may be, for example, a cellular network, and may employ various technologies, including: enhanced data rates for global evolution (EDGE), General Packet Radio Service (GPRS), global system for mobile communication (GSM), Internet protocol multimedia subsystem (IMS), Universal Mobile Telecommunications System (UMTS), etc.; and any other suitable wireless medium, such as World Interoperability for Microwave Access (WiMAX), Long Term Evolution (LTE) networks, Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), wireless fidelity (Wi-Fi), Wireless Local Area Network (WLAN), bluetooth technology, Internet Protocol (IP) datacasting, satellites, mobile ad hoc networks (satellite, mobile-hoc networks, MANET), and so forth; or any combination thereof. The communication network 130 may provide communication capabilities between the driver distraction system 106 and the database 140 and/or other data stores, for example, through the communication device 102 (fig. 1B).

While the embodiment of fig. 1A is described with reference to a vehicle 101, it should be appreciated that the disclosed techniques may be applied to a wide range of technologies and are not limited to vehicles. For example, in addition to vehicles, the disclosed techniques may be used in virtual or augmented reality devices or simulators where head pose and gaze estimation, vehicle data, and/or scene information may be required.

Additional details of the driver fatigue system 106, in accordance with certain embodiments of the present technique, will now be described with reference to FIG. 1B. The driver fatigue system 106 includes a capture device 103, one or more processors 108, a vehicle system 104, a machine learning engine 109, an input/output (I/O) interface 114, a memory 116, a visual/audio warning device 118, a communication device 120, and a database 140 (which may also be part of the driver fatigue system).

The capture device 103 may be responsible for monitoring and identifying driver behavior (including fatigue) based on captured driver motion and/or audio data using one or more capture devices positioned within the cab, such as a sensor 103A, camera 103B, or microphone 103C. In one embodiment, the capture device 103 is positioned to capture the head and face movements of the driver, while in other embodiments, the torso of the driver and/or the limbs and hands of the driver are also captured. For example, the detection and tracking 108A, the head pose estimator 108B, and the gaze direction estimator 108C may monitor driver motion captured by the capture device 103 to detect a particular pose, such as a head pose, or to detect whether the driver is looking in a particular direction.

Other embodiments include audio data captured by the microphone 103C along with or separate from the driver motion data. The captured audio may be, for example, an audio signal of the driver 102 captured by the microphone 103C. The audio may be analyzed to detect various characteristics that may vary depending on the driver's state. Examples of such audio features include driver speech, passenger speech, music, and the like.

Although the capture device 103 is described as a single device having multiple components, it should be appreciated that each component (e.g., sensor, camera, microphone, etc.) may be a separate component located in different areas of the vehicle 101. For example, sensor 103A, camera 103B, microphone 103C, and depth sensor 103D may each be located in a different area of the vehicle cab. In another example, a single component of the capture device 103 may be another component or part of a device. For example, camera 103B and visual/audio 118 may be part of a mobile phone or tablet computer (not shown) placed in the vehicle cab, while sensor 103A and microphone 103C may be separately located in different locations in the vehicle cab.

The detection and tracking 108A monitors the facial features of the driver 102 captured by the capture device 103, which may then be extracted subsequent to detecting the driver's face. The term facial features include, but are not limited to, points (or facial landmarks) around the eyes, nose, and mouth regions, as well as points outlining portions of the detected face of the driver 102. Based on the monitored facial features, an initial position of one or more eye features of the eyes of the driver 102 may be detected. The eye features may include an iris and first and second corners of an eyeball. Thus, for example, detecting the location of each of the one or more eye features includes: the method includes detecting a position of an iris, detecting a position of a first canthus, and detecting a position of a second canthus.

The head pose estimator 108B uses the monitored facial features to estimate the head pose of the driver 102. As used herein, the term "head pose" describes an angle that refers to the relative orientation of the driver's head with respect to the plane of the capture device 103. In one embodiment, the head pose includes a yaw angle and a pitch angle of the driver's head relative to the capture device plane. In another embodiment, the head pose includes a yaw angle, a pitch angle, and a roll angle of the driver's head relative to the capture device plane.

The gaze direction estimator 108C estimates the gaze direction (and gaze angle) of the driver. In operation of gaze direction estimator 108C, capture device 103 may capture an image or set of images (e.g., of the vehicle driver). Capture device 103 may send an image to gaze direction estimator 108C, where gaze direction estimator 108C detects facial features from the image and tracks (e.g., over time) the driver's gaze. One such gaze direction estimator isSmart Eye

The eye tracking system of (1).

In another embodiment, the gaze direction estimator 108C may detect eyes from the captured images. For example, gaze direction estimator 108C may rely on the eye center to determine the gaze direction. In short, it may be assumed that the driver looks forward with respect to the orientation of his head. In some embodiments, the gaze direction estimator 108C provides more accurate gaze tracking by detecting pupil or iris position or using a geometric model based on the estimated head pose and the detected positions of the iris and each of the first and second canthi. Pupil and/or iris tracking enables the gaze direction estimator 108C to detect gaze directions decoupled from head gestures. Drivers often visually scan the surrounding environment with little or no head movement (e.g., pan left or right (or up or down) to better see items or objects out of their direct line of sight). These visual scans occur frequently for objects on or near the road (e.g., to view road signs, pedestrians near the road, etc.) and for objects within the vehicle cabin (e.g., to view console readings such as speed, operate a radio or other built-in device, or view/operate a personal mobile device). In some cases, the driver may pan some or all of these objects (e.g., outside of his canthus) with minimal head movement. By tracking the pupil and/or iris, the gaze direction estimator 108C may detect upward, downward, and lateral saccades that would otherwise be undetectable in a system that simply tracks the position of the head.

In one embodiment, and based on the detected facial features, the gaze direction estimator 108C may cause the processor 108 to determine a gaze direction (e.g., a gaze direction of an operator on the vehicle). In some embodiments, gaze direction estimator 108C receives a series of images (and/or video). Gaze direction estimator 108C may detect facial features in multiple images (e.g., a series of images or a sequence of images). Thus, the gaze direction estimator 108C may track gaze direction over time and store such information, for example, in the database 140.

In addition to the gesture and gaze detection described above, the processor 108 may also include an image corrector 108D, a video enhancer 108E, a video scene analyzer 108F, and/or other data processing and analysis to determine scene information captured by the capture device 103.

Image corrector 108D receives the captured data and may make corrections, such as video stabilization. For example, bumps in the road may jitter, blur, or distort the data. The image corrector may stabilize the image for horizontal and/or vertical shake, and/or may correct for translation, rotation, and/or scaling.

The video enhancer 108E may perform additional enhancement or processing in the event of poor lighting or high data compression. Video processing and enhancement may include, but is not limited to, gamma correction, de-blushing, and/or de-blurring. Other video processing enhancement algorithms may be run to reduce noise in the input of the low-light video, and then use contrast enhancement techniques such as, but not limited to, tone mapping, histogram stretching and equalization, and gamma correction to recover visual information in the low-light video.

The video scene analyzer 108F may identify the content of the video from the capture device 103. For example, the content of the video may include a scene or sequence of scenes from the forward-facing camera 103B in the vehicle. Video analysis may involve a variety of techniques including, but not limited to, low-level content analysis such as feature extraction, structural analysis, object detection, and tracking, and high-level semantic analysis such as scene analysis, event detection, and video mining. For example, by identifying the content of the incoming video signal, it can be determined that: whether the vehicle 101 is traveling along a highway or within an urban area, whether there are any pedestrians, animals or other objects/obstacles on the road, etc. By performing image analysis (e.g., video scene analysis, etc.) concurrently or prior to performing image processing (e.g., image correction, video enhancement, etc.), image data may be prepared in a manner specific to the type of analysis being performed. For example, image correction to reduce blur may allow video scene analysis to be performed more accurately by cleaning the appearance of edge lines for object recognition.

The vehicle system 104 may provide signals corresponding to the output of any state of the vehicle, the vehicle surroundings, or any other source of information connected to the vehicle. Vehicle data output may include, for example: analog signals such as current speed, digital signals provided by a single source of information such as a clock, a thermometer, and a position sensor such as a Global Positioning System (GPS) sensor, digital signals propagated through a vehicle data network such as an engine Controller Area Network (CAN) bus over which engine-related information may be communicated, a climate control CAN bus over which climate control-related information may be communicated, and a multimedia data network over which multimedia data is communicated between multimedia components in the vehicle. For example, the vehicle system 104 may retrieve from the engine CAN bus the current speed of the vehicle estimated by the wheel sensors, the state of the vehicle's power based on the vehicle's battery and/or power distribution system, the ignition state of the vehicle, and so forth.

The input/output interface 114 enables information to be presented to a user and/or other components or devices using a variety of input/output devices. Examples of input devices include a keyboard, a microphone, touch functionality (e.g., capacitive or other sensors for detecting physical touches), a camera (e.g., which may employ visible or invisible wavelengths, such as infrared frequencies, to recognize motion as gestures that do not involve touch), and so forth. Examples of output devices include visual/audio warning devices 118, such as a display, a speaker, and the like. In one embodiment, an input/output (I/O) interface 114 receives driver motion data and/or audio data of the driver 102 from the capture device 103. The driver motion data may be related to, for example, the eyes and face of the driver 102, which may be analyzed by the processor 108.

The data collected by the driver fatigue system 106 may be stored in the database 140, the memory 116, or any combination thereof. In one embodiment, the collected data is from one or more sources external to vehicle 101. The stored information may be data related to driver distraction and safety, such as information captured by the capture device 103. In one embodiment, the data stored in the database 140 may be a collection of data collected for one or more drivers of the vehicle 101. In one embodiment, the collected data is head pose data of a driver of the vehicle 101. In another embodiment, the collected data is gaze direction data of the driver of the vehicle 101. The collected data may also be used to generate data sets and information that may be used to train a model for machine learning, such as machine learning engine 109.

In one embodiment, the memory 116 may store instructions executable by: a processor 108, a machine learning engine 109, and programs or applications (not shown) that may be loaded and executed by the processor 108. In one embodiment, machine learning engine 109 includes executable code stored in memory 116 that is executable by processor 108 and selects one or more machine learning models stored in memory 116 (or database 140). The machine model may be developed and trained using, for example, a data set generated according to embodiments described below, using well-known and conventional machine learning and deep learning techniques, such as implementation of a Convolutional Neural Network (CNN).

Fig. 2 illustrates an example of an expression recognition network. The expression recognition network 202 receives an arbitrary facial image 201A, which can be captured using capture devices such as a camera 103 and a scanner, and an image database such as the database 104, and the like. Any facial image 201A, which may be arbitrary in nature, is processed by the expression recognition network 202 having an automatic coding style network architecture to output a facial expression image 201B. The representations learned during automatic code learning are then used to assist in forming a data set to train a machine model, such as those described above. The machine model may then be used to generate a driver's false fatigue state video, which may be used in conjunction with the individualized data to train an application to detect driver fatigue (i.e., drowsiness, or tiredness) for a particular driver.

In performing the automatic code learning, an arbitrary face image 201A input to the expression recognition network 202 is classified into a category or a category. In one embodiment, the input arbitrary facial image 201A is an arbitrary facial expression such as anger, fear, or neutral image, and the output facial expression image 201B is a facial expression or emotion image that has been classified into categories or categories such as disgust, sadness, joy, or surprise.

In one example, the expression recognition network 202 generates the facial expression image 201B from the input arbitrary face image 201A using a neural network, such as an auto-encoder (AE) or a conditional variant auto-encoder (CVAE). The expression recognition network 202 is intended to efficiently learn potential or learned representations (or codes), i.e., learned representations Z_gIt generates an output expression 201B from the arbitrary face image 201A. For example, an arbitrary image of fear may use the mathematical representation Z_gTo generate a surprisingly facial expression image.

Learning is done in layers (e.g., encoder and decoder layers) that are attached to the learned representation Z. For example, an input arbitrary face image 201A is input into the first layer (e.g., the encoder 204). The chemical representation of Z_gThe size of the input arbitrary face image 201A is compressed (reduced). The input arbitrary face image 201A is reconstructed in a second layer (e.g., the decoder 206) that outputs a facial expression image 201B corresponding to the input arbitrary face image 201A. More specifically, the expression recognition network 202 is trained to encode an input arbitrary facial image 201A into the mathematical representation Z_gFrom which the mathematical representation Z can be derived_gTo reconstruct an input arbitrary face image 201A. In one embodiment, encoder 204 creates the mathematical representation Z from Z ═ σ (Wx + b)_gWhere "W" is the coding weight, "b" is the deviation vector, and "σ" is a logistic function (such as Sigmoid function or modified linearCell), and "x" is an input arbitrary face image 201A. The expression recognition network 202 also contains a decoder 206 that reconstructs the input arbitrary facial image 201A from x '═ σ' (W 'z + B'), where "W '" is the decoding weight, "B'" is the deviation vector, "σ" is the logic function, and "x" is the output facial expression image 201B. The learning includes minimizing reconstruction errors associated with encoding and decoding such that min_W，b||Y-X||_F。

The mathematical representation Z can then be_gFor training further machine models, as explained below with reference to fig. 3.

Fig. 3 illustrates a facial fatigue level generator network. The facial fatigue level generator network 302 includes a CVAE 304 and a Generative Adaptive Network (GAN) 306. The facial fatigue level generator network 302 receives content, such as a sequence of images or video, which is processed to identify whether the input content is "real" content or "fake" content.

In one embodiment, CVAE 304 is coupled for receiving content that is processed to output a reconstructed version of the content. Specifically, the CVAE 304 receives the facial expression image stream F_i-1→iWherein the stream F includes image frames from the i-1 st image frame to the i-th image frame. Facial expression image stream F_i-1→iIncluding respectively from different degrees L₀、L_i-1To L_iFacial expression image of

To

Wherein the content of the first and second substances,

represents an Identification (ID) of a specific individual having a natural or neutral facial expression (e.g., a facial expression of the specific individual shown in a normal or ordinary expression state), and

and

representing facial expression images calculated from previous and current degrees (i.e., calculated during previous or current iterations). In one embodiment, the facial expression image is a facial fatigue image.

As shown, CVAE 304 includes an encoder 304A and a decoder (or generator) 306A. Encoder 304A receives signals at different levels L₀、L_i-1To L_iA facial expression image stream F_i-1→iAnd mapping each of the facial expression images to the mathematical representation z by learning the distribution P (z | x, c)_iWhere "c" is the category or class of data and "x" is the image. z is equal to z_i+z_g. That is, the stream of facial expression images is converted to the mathematical representation z_i(e.g., feature vectors) that can be considered a compressed representation of the input to encoder 304A. In one embodiment, encoder 304A is a Convolutional Neural Network (CNN).

As shown, decoder 306A is used to use the mathematical representation z_g(FIG. 2) mathematical representation of the cascade z_iTo invert the output of encoder 304A. Then using the cascade of the mathematical representation (z)_i+z_g) To generate a reconstructed version of the input from encoder 304A. This reconstruction of the input is referred to as being at the level L_i+1Reconstructed image of (a)

Reconstructing an image

Representing facial expression images showing different degrees of fatigue (e.g., drowsiness, tiredness, etc.) for each frame in the flow of facial expressions during each iteration.

The GAN 306 includes a generator (or decoder) 306A and a discriminator 306B. In one embodiment, the GAN is a hybrid of two or more of the foregoing306 is CNN. As explained above, the generator 306A receives the cascaded learned representation (z)_i+z_g) As input, and outputting a reconstructed image

GAN 306 also includes a discriminator 306B. Discriminator 306B is coupled to receive the original content from generator 306A (e.g.,

) And reconstructing the content (e.g.,

) And learns to distinguish between "true" and "false" samples of content (i.e., to predict whether reconstructed content is true or false). This may be accomplished by training discriminator 306B to reduce discriminator Loss_GDTo achieve that said discriminator loses Loss_GDIndicating the cost of generating an incorrect prediction by the discriminator in response to receiving the original content as well as the reconstructed content generated by the CVAE 304. In this manner, the parameters of the discriminator 306B are used to distinguish between the trained and reconstructed versions of the content based on the difference between the two versions that occurs during the encoding process. For example, discriminator 306B receives a signal at level L_i+1Reconstructed image of (a)

(or if at an initial level L₀The image of natural facial expression

) And ground real image

As an input. To predict a reconstructed image

Whether true or false, the discriminator 306B is trained to minimize or reduce the loss function of the GAN 306. The minimum loss function of GAN 306 isIs defined as min_GANLoss＝Loss_GD+Loss_EPWherein Loss_GDRepresents discriminator Loss, and Loss_EPRepresenting a loss of reconstruction. Loss can be calculated using the following function_GD：Loss_GD＝E[log(D(x))]+E[log(1-D(G(z)))]And can be calculated using the following function

Wherein D () is a discriminator, G () is a generator, E [, ]]Is the expected value and z is the learned representation (or code).

The facial fatigue generation network 302 predicts whether the reconstructed image is real or false using a loss function, where the value of the minimized loss function for a real image should be smaller than the value of the minimized loss function for a false image. In one embodiment, tag 1 (true) is assigned to the original content, and tag 0 (false) is assigned to the reconstructed content determined to be false (not true). In this case, the discriminator 306B may predict the content of the input as a reconstructed (i.e., false) version when the corresponding discrimination prediction is below the threshold, and may predict the content of the input as true (true image) when the corresponding prediction is above the threshold

). In the next iteration, use the actual image

Replacement image

And using the image

Replacement image

In this manner, the discriminator 306B outputs, for each piece of input content and reconstructed content, a discrimination prediction indicating whether the input content is an original version or a reconstructed version.

It should be appreciated that the generator and/or discriminator may include various types of machine learning models. The machine learning model may include a linear model and a non-linear model. As examples, the machine learning model may include a regression model, a support vector machine, a decision tree-based model, a Bayesian (Bayesian) model, and/or a neural network (e.g., a deep neural network). The neural network may include a feed-forward neural network, a recurrent neural network (e.g., a long-short term memory recurrent neural network), a convolutional neural network, or other form of neural network. Thus, although generators and discriminators are sometimes referred to as "networks," they are not necessarily limited to neural networks, but may also include other forms of machine learning models.

Fig. 4A illustrates a video prediction and interpolation network. Video prediction and interpolation network 402 includes a frame interpolation network 402A and a Long Short Term Memory (LSTM) auto-encoder network 402B. LSTM effectively preserves motion trends (modes) and passes them on to predicted frames, while the interpolation network generates intermediate images from wider frames. Therefore, the frame can be interpolated while maintaining the motion tendency.

Frame interpolation network 402A is used to generate new frames from original frames of video content. In doing so, the network predicts one or more intermediate pictures in time steps (or timestamps) defined between two consecutive frames. The first neural network 410 approximates optical flow data that defines motion between two consecutive frames. The second neural network 412 refines the optical flow data and predicts a visibility map for each time step. Warping two consecutive frames according to the refined optical flow data for each time step to produce a pair of warped frames for each time step. Then, the second neural network fuses the pair of warped frames based on the visibility map to generate an intermediate image for each time step. In the predicted intermediate image, artifacts caused by motion boundaries and occlusion are reduced.

One example of frame interpolation is disclosed in the illustrated embodiment. When the frame interpolation network 402A is provided with, for example, an image at time t e (t-1, t)

And

can predict an intermediate (or interpolated) image

To perform interpolation, a bi-directional optical flow between two input images is first calculated. In one embodiment, the optical flow may be calculated using CNN. For example, two input images may be used

And

to train the CNN to jointly predict the forward optical flow F between two input images_i-1→iAnd backward optical flow F_i→i-1(optical flow between frames). Similarly, the frame interpolation network 402A may receive images

And

in the frame interpolation network, the forward optical flow F between input images is jointly predicted_i+1→iAnd backward optical flow F_i→i+1. As described in more detail below with reference to FIG. 4B, the frame interpolation network processes the input images and intermediate images

Is output to

An example of an interpolation network is described in "Super SloMo: high Quality Estimation of Multiple Intermediate Frames for Video Interpretation (High Quality Estimation of Multiple Intermediate Frames for Video Interpretation), "Jiang et al, IEEE Computer Vision and Pattern Recognition Conference book (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition), 2018.

The LSTM autoencoder network 402B learns the representation of the image sequence. In one embodiment, the LSTM autoencoder network 402B uses Recurrent Neural Networks (RNNs) formed from LSTM units or memory blocks for learning. For example, a first RNN is an encoder that maps an input image sequence 404 (e.g., a sequence of image frames) into a fixed-length representation, which is then decoded using a second RNN, such as a decoder. Encoder pairs such as pictures in the LSTM autoencoder network 402B

And

to generate a representation of the input image sequence 404.

After reading the last input, the learned representation generated using the input sequence is processed using a decoder in the LSTM autoencoder network 402B. The decoder outputs a prediction of the generated target sequence for the input sequence. The target sequence is identical in content to the input sequence, in reverse order. In one embodiment, the decoder in LSTM autoencoder network 402B includes one or more LSTM layers and is configured to receive the current output in the target sequence to generate a corresponding output score. The output score for a given output indicates the likelihood that the output is the next output in the target sequence, i.e., predicts whether the output represents the next output in the target sequence. As part of generating the output score, the decoder also updates the hidden state of the network to generate an updated hidden state. The LSTM autoencoder network 402B is further explained below with reference to fig. 4C.

An example LSTM neural network is described in "Unsupervised Learning of Video Representations using LSTM" (unsupplied Learning of Video reproduction using LSTMs), "Srivastava et al, university of toronto, 2016 (month 1).

Fig. 4B illustrates an exemplary frame interpolation network according to fig. 4A. Frame interpolation network 402A includes: fusing warped input images to generate an intermediate image

An encoder 410 and a decoder 412. More specifically, two input images are combined

And

warping to a particular time step t, and adaptively fusing the two warped images to generate an intermediate image

In one embodiment, flow computation CNN is used to estimate bi-directional optical flow between two input images, and flow interpolation CNN is used to refine the flow approximation and predict the visibility map. The visibility map may then be applied to the two warped images before fusion, so that artifacts in the interpolated intermediate image are reduced. In one embodiment, the stream computation CNN and stream interpolation CNN are as described in "U-net: convolutional networks for biomedical image segmentation (MIC-CAI), the U-Net architecture described in 2015.

In the illustrated embodiment, and for the input image

And

the frame interpolation network 402A includes: a stream interpolation network, which in one embodiment is implemented as encoder 410; an intermediate optical flow network, which in one embodiment is implemented as a decoder 412; and optical flow warping unit

(F_t→i，F_t→i+1). The encoder 410 receives the sequence image pair at time stamp (t, t +1)

Image pair based on sequence

Computing bidirectional optical flow

Linearly combining the bi-directional optical flows to approximate an intermediate bi-directional optical flow for at least one time step t between two input images in a sequence of image pairs

Warping (backwards) each of the input images according to the approximate intermediate bi-directional optical flow for each time step to produce warped input frame I_i→tAnd I_i+1→t。

In one embodiment, decoder 412 includes: a stream refinement network (not shown) corresponding to each warp unit and a prediction unit for predicting the intermediate image I at time t e (I, I +1)_tnAn image predictor (not shown). For each time step, using two input frame, an intermediate bidirectional optical flow, and two warped input image pairs

And (5) thinning. Output refined intermediate bidirectional optical flow (F)_t→i，F_t→i+1) And processed by an image prediction unit to generate an intermediate image

In one embodiment, an image prediction unit receives a warped input frame I generated by an optical flow warping unit_t→iAnd I_t→i+1And warping the input frame is image pre-processedThe measurement units are linearly fused to produce intermediate images for each time step.

As described above, in one embodiment, the visibility map is applied to two warped images. To account for occlusions, the stream refinement network in the decoder 412 predicts a visibility map V per time step_t←iAnd V_t←i+1. Due to the use of the visibility map, when the pixel is in

And

when both are visible, the decoder 412 learns to adaptively combine information from both images. In one embodiment, the visibility map is applied to the warped images before the warped images are linearly fused to produce an intermediate image for each time step.

Synthesizing an intermediate image according to the formula

To

Wherein

Is a normalization factor and defines an interpolation loss function according to:

where u (-) is a backward warping function, F_t→iAnd F_t→i+1Is two input images F_i→i+1And F_i+1→iIntermediate optical flow in between, and V_t←iAnd V_t←i+1Are two visibility graphs indicating whether the pixel is still visible, such as "U-net: convolutional network for biomedical image segmentation ".

Interpolating intermediate images from a warped image by applying a visibility map to the warped image prior to fusion

Excluding the contribution of any occluded pixels, thereby avoiding or reducing artifacts.

Fig. 4C illustrates an example of the video prediction and interpolation network of fig. 4A with an expanded view of the LSTM auto-encoder. The basic building block of LSTM auto-encoder 402B is the LSTM storage unit, denoted by RNN. Each LSTM memory cell has a state at time t. To access a memory location for reading or modification, each LSTM memory block may include one or more cells. Each cell includes an input gate, a forgetting gate, and an output gate that allow the cell to store previous activations generated by the cell, for example, as hidden states for generating a current activation or as hidden states to be provided to other components of LSTM autoencoder 402B. As described above, these LSTM memory blocks form the RNN in which learning is performed.

In the illustrated embodiment, the encoder 403 consists of a multi-layer RNN, with arrows showing the direction of information flow. Each of said encoders 403 (in each layer) receives a sequence of input images

And

and corresponding intermediate images generated by the frame interpolation networks 402A-402N, respectively

To

Processing input sequences

And

the input sequence is a collection of images collected by the sensor or from a database and the current hidden state is updated, i.e. modified by processing the currently received input, the current hidden state being generated by processing previous inputs from the input sequence. Then, the corresponding weight w may be set₁Applied to previously hidden states and input vectors.

The learned representation of the input sequence is then processed using a decoder 405 to generate a target sequence of the input sequence. The decoder 405 also includes a multi-layer RNN, where the arrows show the direction of information flow, which predicts the output at time step t. Each RNN accepts hidden state from previous elements and produces and outputs its own hidden state. In one embodiment, hidden states at the current time step and corresponding weights w are used₂And calculating output. The final output may be determined using a probability vector, such as Softmax or some other known function.

Fig. 5A-5D illustrate example flow diagrams in accordance with embodiments of the present technique. In an embodiment, the flow diagrams may be computer-implemented methods performed at least in part by hardware and/or software components illustrated in the various figures and described herein. In one embodiment, the disclosed process may be performed by the driver fatigue system 106 disclosed in fig. 1A and 1B. In one embodiment, a software component executed by one or more processors, such as processor 108, or processor 802, performs at least a portion of the described flow.

Fig. 5A shows a flow chart of compilation of a video data set for a pseudo-fatigue state. The data sets are generated using a transfer learning technique and may be used to train an application to identify driver fatigue. At step 502, a neural network such as AE or CVAE generates a facial expression image learned from an arbitrary face image. In one embodiment, the facial expression image is reconstructed from the learned representation of any facial image learned from the neural network. The learned representation may then be applied during a training phase of the fatigue level generator network.

At step 504, another neural network is trained using neutral, natural or normal facial images in which little expression is seen, and image streams representing different degrees of fatigue. An image representing the current degree of fatigue is generated using the neutral face image and the image stream. As the process iterates, the image stream changes so that the current image becomes the previous image and a new current image is generated. In each iteration, an image representing the current degree of fatigue is generated from the neutral face image and an image representing the degree of fatigue before the current degree of fatigue based on the learned representation. An image (or reconstructed image) representing the current degree of fatigue is reconstructed from the representation learned in step 502 and a second representation of the neutral facial image learned from the neural network. The reconstructed image may then be compared to a ground truth model to determine whether the reconstructed image is authentic or false, as discussed below.

At step 506, intermediate images from the interpolated video data (sequence image data) of the reconstructed facial expression and any facial images during the corresponding optical flow are generated. In one embodiment, the optical flow is formed by fusing a reconstructed facial expression with an arbitrary facial image and is located in a time frame between the reconstructed facial expression and the arbitrary facial image.

At step 508, the false fatigue status video (i.e., the data set) of the driver is compiled using at least the reconstructed facial expressions and arbitrary facial images and the intermediate images of the interpolated video data to train an application therein to detect driver fatigue.

Referring to fig. 5B, illustrated is an example flow diagram of a neural network, such as CVAE 304. In the depicted embodiment, the CVAE includes an encoder and a decoder. At step 510, the facial expression image including the degree and the neutral facial image are encoded, and a parameter describing the distribution of each dimension of the mathematical representation is output. At step 512, the distribution of each dimension of the mathematical representation is decoded and the relationship of each parameter with respect to the output loss is calculated to reconstruct the neutral face image and the facial expression image into a reconstructed image. At step 514, the reconstructed image is compared to the neutral image to generate a discriminator loss, and at step 516, the reconstructed image is compared to the ground truth at the same level to generate a reconstruction loss. At step 518, based on the discriminator loss and the reconstruction loss, a prediction is made as to the likelihood that the reconstructed image has an appearance corresponding to a neutral image. At step 520, the reconstructed image is output as a real image and propagated back to the input of the CVAE for the next iteration.

In fig. 5C, the neural network in step 502 maps the arbitrary face image to a corresponding learned representation at 522, and the learned representation is mapped to a facial expression image having the same shape or the same image size as the arbitrary face image (e.g., the reconstructed image has the same number of columns and rows as the arbitrary image) at 524.

With continued reference to fig. 5D, generation of an intermediate image is described in connection with GAN 306. At step 526, an intermediate image between the facial expression image and the arbitrary face image is predicted during the corresponding optical flow. At step 528, the images are interpolated to generate corresponding optical flows to generate therein a false fatigue state video of the driver. At step 530, the sequence of intermediate images is arranged in input order, and at step 532, the sequence of intermediate images is processed using an encoder to convert the sequence of intermediate images into an alternative representation. Finally, at step 534, the alternative representation of the intermediate image sequence is processed using a decoder to generate a target sequence of the intermediate image sequence, wherein the target sequence comprises a plurality of outputs arranged according to an output order.

FIG. 6 illustrates a computing system upon which embodiments of the present disclosure may be implemented. The computing system 600 may be programmed (e.g., by computer program code or instructions) to improve driver safety through driver fatigue (tiredness) detection as described herein and include a communication mechanism such as a bus 610 for passing information between other internal and external components of the computer system 600. In one embodiment, computer system 600 is system 106 in FIG. 1B. The computer system 600 or a portion thereof constitutes a means for performing one or more steps for improving driver safety through driver distraction (including driver fatigue) detection.

A bus 610 includes one or more parallel conductors of information so that information is transferred quickly among devices coupled to the bus 610. One or more processors 602 for processing information are coupled with the bus 610.

The one or more processors 602 perform a set of operations on information (or data) specified by the computer program code, the information (or data) relating to improving the safety of the driver through driver distraction detection. The computer program code is a set of instructions or statements providing instructions for the operation of the processor and/or the computer system to perform specified functions. For example, the code may be written in a computer programming language that is compiled into the native instruction set of the processor. The code may also be written directly using the native instruction set (e.g., machine language). The set of operations include bringing information in from the bus 610 and placing information on the bus 610. Each operation in a set of operations that can be performed by a processor is represented to the processor by information called instructions, such as an operation code of one or more digits. The sequence of operations, such as the sequence of operation codes, performed by the processor 602 constitute processor instructions, also referred to as computer system instructions, or simply computer instructions.

Computer system 600 also includes a memory 604 coupled to bus 610. The memory 604, such as a Random Access Memory (RAM) or any other dynamic storage device, stores information including processor instructions for improving driver safety through driver distraction detection. Dynamic memory allows computer system 600 to change the information stored therein. RAM allows a unit of information stored at a location called a memory address to be stored and retrieved independently of information at neighboring addresses. The memory 604 is also used by the processor 602 to store temporary values during execution of processor instructions. Computer system 600 also includes a Read Only Memory (ROM) 606 or any other static storage device coupled to bus 610 for storing static information. A non-volatile (persistent) storage device 608, such as a magnetic disk, optical disk or flash card, for storing information, including instructions, is also coupled to bus 610.

In one embodiment, information including instructions for improving the safety of a tired driver using the information processed by the above-described systems and embodiments is provided to bus 610 for use by a processor from an external input device 612, such as a keyboard operated by a human user, a microphone, an Infrared (IR) remote control, a joystick, a game pad, a stylus pen, touch screen, head mounted display, or sensors. A sensor detects conditions in its vicinity and converts these detections into physical expression compatible with the measurable phenomenon used to represent information in computer system 600. Other external devices coupled to bus 610 primarily for interacting with humans include: a display device 614 for presenting text or images; and a pointing device 616, such as a mouse, trackball, cursor direction keys, or motion sensor, for controlling the position of small cursor images presented on the display 614 and issuing commands associated with graphical elements presented on the display 614 and one or more camera sensors 684 for capturing, recording, and storing one or more still and/or moving images (e.g., video, movies, etc.) that may also include audio recordings.

In the illustrated embodiment, dedicated hardware, such as an Application Specific Integrated Circuit (ASIC) 620, is coupled to the bus 610. Dedicated hardware is used to perform operations not performed by processor 602 quickly enough for special purposes.

Computer system 600 also includes a communication interface 670 coupled to bus 610. Communication interface 670 provides a one-way or two-way communication coupling to a variety of external devices that operate in conjunction with its own processor. Generally speaking, the coupling utilizes a network link 678 that is connected to a local network 680 to which various external devices, such as servers or databases, may be connected. Alternatively, link 678 may be directly connected to an Internet Service Provider (ISP) 684 or to a network 690, such as the Internet. Network link 678 may be wired or wireless. For example, communication interface 670 may be a parallel port or a serial port or a Universal Serial Bus (USB) port on a personal computer. In some embodiments, communications interface 670 is an Integrated Services Digital Network (ISDN) card or a Digital Subscriber Line (DSL) card or a telephone modem that provides an information communication connection to a corresponding type of telephone line. In some embodiments, a communication interface 670 is a cable modem that converts signals on bus 610 into signals for a communication connection over a coaxial cable or into optical signals for a communication connection over a fiber optic cable. As another example, communication interface 670 may be a Local Area Network (LAN) card to provide a data communication connection to a compatible LAN, such as ethernet. Wireless links may also be implemented. For wireless links, the communication interface 670 sends and/or receives electrical, acoustic, or electromagnetic signals, including infrared and optical signals, that carry information streams, such as digital data. For example, in a wireless handheld device, such as a mobile telephone like a cellular telephone, the communications interface 670 includes a radio band electromagnetic transmitter and receiver referred to as a radio transceiver. In certain embodiments, the communications interface 670 enables connection to a communications network in order to improve the safety of tired drivers using mobile devices such as mobile phones or tablets.

Network link 678 typically provides information through one or more networks using transmission media to other devices that use or process the information. For example, network link 678 may provide a connection through local network 680 to a host computer 682 or to equipment 684 operated by an ISP. ISP equipment 684 in turn provides data communication services through the public, world-wide packet-switching communication network of networks now commonly referred to as the Internet 690.

A computer called a server host 682 connected to the internet hosts a process that provides a service in response to information received on the internet. For example, server host 682 hosts a process that provides information representing video data for presentation at display 614. It is contemplated that the components of system 600 may be deployed in a variety of configurations within other computer systems, e.g., host 682 and server 682.

At least some embodiments of the present disclosure are related to the use of computer system 600 for implementing some or all of the techniques described herein. According to one embodiment of the disclosure, these techniques are performed by computer system 600 in response to processor 602 executing one or more sequences of one or more processor instructions contained in memory 604. These instructions, also referred to as computer instructions, software, and program code, may be read into memory 604 from another computer-readable medium, such as storage device 608 or network link 678. Execution of the sequences of instructions contained in memory 604 causes processor 602 to perform one or more of the method steps described herein.

It should be understood that the present subject matter may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and complete, and will fully convey the disclosure to those skilled in the art. Indeed, the present subject matter is intended to cover alternatives, modifications, and equivalents of these embodiments, which are included within the scope and spirit of the present subject matter as defined by the appended claims. Furthermore, in the following detailed description of the present subject matter, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. It will be apparent, however, to one of ordinary skill in the art that the present subject matter may be practiced without these specific details.

Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

Computer-readable non-transitory media include all types of computer-readable media, including magnetic, optical, and solid-state storage media, and specifically exclude signals. It should be understood that the software may be installed in and sold with the device. Alternatively, the software may be retrieved and loaded into the device, including retrieving the software via disk media or from any form of network or distribution system, including for example retrieving the software from a server owned by the software creator or from a server that is not owned but used by the software creator. For example, the software may be stored on a server for distribution over the internet.

Computer-readable storage media exclude propagated signals themselves, accessible by a computer and/or processor, and include volatile and non-volatile internal and/or external media, which may be removable and/or non-removable. For computers, various types of storage media may store data in any suitable digital format. It should be appreciated by those skilled in the art that other types of computer readable media can be employed such as zip drives, solid state drives, magnetic tape, flash memory cards, flash drives, magnetic cassettes, and the like, for storing computer executable instructions for performing the novel methods (acts) of the disclosed architecture.

The terminology used herein is for the purpose of describing particular aspects only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects disclosed herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various modifications as are suited to the particular use contemplated.

For purposes of this document, each flow associated with the disclosed technology may be performed continuously by one or more computing devices. Each step in the flow may be performed by the same or a different computing device as used in the other steps, and each step does not necessarily have to be performed by a single computing device.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A computer-implemented method for training an application to identify driver fatigue, comprising:

generating a plurality of first facial expression images from a plurality of second facial expression images using a first neural network, wherein the plurality of first facial expression images are reconstructed from first representations of the plurality of second facial expression images learned from the first neural network;

generating a first image expressing a current fatigue degree from a third facial expression image and a second image expressing a fatigue degree before the current fatigue degree using a second neural network based on the first representation, wherein the first image and the second image are reconstructed from the first representation and a second representation of the third facial expression image learned from the second neural network;

generating a plurality of intermediate images of interpolated video data from the first image and the second image during respective optical flows, wherein the optical flows are formed by fusing the first image and the second image and are located in a time frame between the first image and the second image; and

compiling a false fatigue state video of a driver using at least the first and second images and the plurality of intermediate images of the interpolated video data to train the application therein to detect the driver fatigue.

2. The computer-implemented method of claim 1, wherein the first neural network performs the steps of:

mapping the plurality of second facial expression images to respective first representations; and

mapping the respective first representations to the plurality of first facial expression images having the same shape as the plurality of second facial expression images.

3. The computer-implemented method of claim 1, wherein the second neural network comprises a conditional variational auto-encoder that performs the steps of:

encoding the third facial expression image and the second image and outputting a parameter describing a distribution of each dimension of the second representation; and

decoding the distribution for each dimension of the second representation by calculating a relationship of each parameter with respect to output loss to reconstruct the third facial expression image and the second image.

4. The computer-implemented method of any of claims 1 to 3, wherein the second neural network further comprises a generative confrontation network that performs the steps of:

comparing the reconstructed image to the third facial expression image to generate a discriminator loss;

comparing the reconstructed image with ground truth images at the same degree to generate a reconstruction loss;

predicting a likelihood that the reconstructed image has an appearance corresponding to the third facial expression image based on the discriminator loss and the reconstruction loss; and

when the prediction classifies the first image as a real image, the reconstructed image is output as the first image expressing a current degree of fatigue and is input to the condition-variant auto-encoder as the second image expressing a degree of fatigue before the current degree of fatigue.

5. The computer-implemented method of any of claims 1-4, wherein the reconstruction loss indicates a degree of dissimilarity between the third facial expression image and the reconstructed image, and the discriminator loss indicates a cost of generating an incorrect prediction that the reconstructed image has the appearance of the third facial expression image.

6. The computer-implemented method of any of claims 1 to 4, further comprising: iteratively generating the first image at different degrees of fatigue according to differences between the first image and the second image at different time frames until a total value of the reconstruction loss and the discriminator loss meets a predetermined criterion.

7. The computer-implemented method of claim 1, wherein generating the plurality of intermediate images further comprises:

predicting an intermediate image between the first image and the second image during the respective optical flows; and

interpolating the first image and the second image to generate the respective optical flows to generate the false fatigue state video of the driver therein.

8. The computer-implemented method of claim 1, wherein generating the plurality of intermediate images further comprises:

receiving a sequence of intermediate images arranged in an input order;

processing the sequence of intermediate images using an encoder to convert the sequence of intermediate images into an alternative representation of the sequence of intermediate images; and

processing, using a decoder, the alternative representation of the sequence of intermediate images to generate a target sequence of the sequence of intermediate images, the target sequence comprising a plurality of outputs arranged according to an output order.

9. The computer-implemented method of claim 1, wherein the first representation maps the plurality of second facial expression images to the first representation through a learning distribution.

10. The computer-implemented method of claim 1, wherein the second representation maps the third facial expression image to the second representation through a learning distribution.

11. An apparatus for training an application to identify driver fatigue, comprising:

a non-transitory memory comprising instructions; and

one or more processors in communication with the memory, wherein the one or more processors execute the instructions to:

12. The apparatus of claim 11, wherein the first neural network performs the steps of:

mapping the respective first representations to the plurality of first facial expression images having the same expression as the plurality of second facial expression images.

13. The apparatus of claim 11, wherein the second neural network comprises a conditional variational auto-encoder that performs the steps of:

14. The apparatus of any one of claims 11 to 13, wherein the second neural network further comprises a generative confrontation network that performs the steps of:

15. The apparatus of any one of claims 11 to 14, wherein the reconstruction loss indicates a degree of dissimilarity between the third facial expression image and the reconstructed image, and the discriminator loss indicates a cost of generating an incorrect prediction that the reconstructed image has the appearance of the third facial expression image.

16. The device of any one of claims 11 to 14, wherein the one or more processors execute the instructions to: iteratively generating the first image at different degrees of fatigue according to differences between the first image and the second image at different time frames until a total value of the reconstruction loss and the discriminator loss meets a predetermined criterion.

17. The apparatus of claim 11, wherein generating the plurality of intermediate images further comprises:

18. The apparatus of claim 11, wherein generating the plurality of intermediate images further comprises:

receiving a sequence of intermediate images arranged in an input order;

19. The apparatus of claim 11, wherein the first representation maps the plurality of second facial expression images to the first representation through a learning distribution.

20. The apparatus of claim 11, wherein the second representation maps the third facial expression image to the second representation through a learning distribution.

21. A non-transitory computer readable medium storing computer instructions for training an application to identify driver fatigue, which when executed by one or more processors, cause the one or more processors to perform the steps of:

22. The non-transitory computer readable medium of claim 21, wherein the first neural network performs the steps of:

23. The non-transitory computer readable medium of claim 21, wherein the second neural network comprises a conditional variational auto-encoder that performs the steps of:

24. The non-transitory computer readable medium of any one of claims 21 to 23, wherein the second neural network further comprises a generative confrontation network that performs the steps of:

25. The non-transitory computer readable medium of any one of claims 21-24, wherein the reconstruction loss indicates a degree of dissimilarity between the third facial expression image and the reconstructed image, and the discriminator loss indicates a cost of generating an incorrect prediction that the reconstructed image has the appearance of the third facial expression image.

26. The non-transitory computer readable medium of any one of claims 21 to 24, further comprising: iteratively generating the first image at different degrees of fatigue according to differences between the first image and the second image at different time frames until a total value of the reconstruction loss and the discriminator loss meets a predetermined criterion.

27. The non-transitory computer-readable medium of claim 21, wherein generating the plurality of intermediate images further comprises:

28. The non-transitory computer-readable medium of claim 21, wherein generating the plurality of intermediate images further comprises:

receiving a sequence of intermediate images arranged in an input order;

29. The non-transitory computer-readable medium of claim 21, wherein the first representation maps the plurality of second facial expression images to the first representation through a learning distribution.

30. The non-transitory computer readable medium of claim 21, wherein the second representation maps the third facial expression image to the second representation through a learning distribution.