EP4042318A1

EP4042318A1 - System and method of generating a video dataset with varying fatigue levels by transfer learning

Info

Publication number: EP4042318A1
Application number: EP19828454.9A
Authority: EP
Inventors: Chengcheng JIA; Lei Yang
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2022-08-17
Also published as: CN114303177A; WO2020226696A1

Abstract

The disclosure relates to technology for training an application to recognize driver fatigue. Facial expression images are reconstructed from a first representation of images learned from a first neural network. An image is generated expressing a current level or fatigue from images generated at a preceding interval or level and based on the first representation using a second neural network. The images are reconstructed from the first and a second representation learned from the second neural network, and intermediate images of interpolated video data are generated from a corresponding optical flow of images, where the optical flow is formed by fusing together images in a time frame between images. A fake fatigued-state video of a driver is the compiled from the data in which to train an application to detect driver fatigue.

Description

SYSTEM AND METHOD OF GENERATING A VIDEO DATASET WITH VARYING FATIGUE LEVELS BY TRANSFER LEARNING

FIELD

[0001] The disclosure generally relates to detection of driver fatigue, and in particular, to generate a video dataset to train an application for use to recognize when a driver is tired.

BACKGROUND

[0001] Driver fatigue or drowsiness is increasingly becoming a frequent cause of vehicular accidents. Driver detection and monitoring of drowsiness is critical in assuring a safe driving environment not only for the drowsy driver, but also for other drivers in the vicinity that may be affected by the drowsy driver. Vehicles with the ability to monitor a driver allow for measures to be taken by the vehicle to prevent or assist in preventing accidents as a result of the driver being drowsy. For instance, warning systems can be enabled to alert the driver that she is drowsy or automatic features, such as braking and steering, may be enabled to bring the vehicle under control until such time the driver is no longer tired. However, there are few public datasets that may train an application to perform such detection and monitoring of specific drivers, where each driver has his or her own personal capability of withstanding varying levels of fatigue as well as different indicators that demonstrate varying levels of sleepiness for a specific driver. Thus, if the driver sleepiness status is determined according to a single standard, the driver detection and monitoring system may over-respond or under-respond, which may end up increasing the safety of drivers.

BRIEF SUMMARY

[0002] According to one aspect of the present disclosure, there is a computer-implemented method for training an application to recognize driver fatigue generating multiple first facial expression images from multiple second facial expression images using a first neural network, wherein the multiple first facial expression images are reconstructed from a first representation of the multiple second facial expression images learned from the first neural network; generating a first image, expressing a current level of fatigue, from a third facial expression image and a second image, expressing a level of fatigue preceding the current level of fatigue, based on the first representation using a second neural network, wherein the first and second images are reconstructed from the first representation and a second representation of the third facial expression image learned from the second neural network; generating multiple intermediate images of interpolated video data from the first and second images during a corresponding optical flow, where the optical flow is formed by fusing the first and second images and is located in a time frame between the first and second images; and compiling a fake fatigued-state video of a driver using at least the first and second images and the multiple intermediate images of the interpolated video data in which to train the application to detect the driver fatigue.

[0003] Optionally, in any of the preceding aspects, wherein the first neural network performs the steps of mapping the multiple second facial expression images to a corresponding first representation; and mapping the corresponding first representation to the multiple first facial expression images having a same expression as the multiple second facial expression images.

[0004] Optionally, in any of the preceding aspects, wherein the second neural network comprises a conditional variational auto-encoder that performs the steps of encoding the third facial expression image and the second image and outputting parameters describing a distribution for each dimension of the second representation; and decoding the distribution for each dimension of the second representation by calculating the relationship of each parameter with respect to an output loss to reconstruct the third facial expression image and the second image.

[0005] Optionally, in any of the preceding aspects, wherein the second neural network further comprises a generative adversarial network that performs the steps of comparing the reconstructed image to the third facial expression image to generate a discriminator loss; comparing the reconstructed image to a ground truth image at a same level to generate a reconstructed loss; predicting a likelihood that the reconstructed image has an appearance that corresponds to the third facial expression image based on the discriminator loss and the reconstructed loss; and outputting the reconstructed image as the first image, expressing a current level of fatigue, for input to the conditional variational auto-encoder as the second image, expressing a level of fatigue preceding the current level of fatigue, when the prediction classifies the first image as real.

[0006] Optionally, in any of the preceding aspects, wherein the reconstruction loss indicates a dissimilarity between the third facial expression image and the reconstructed image, and the discriminator loss indicates a cost of generating incorrect predictions that the reconstructed image has the appearance of the third facial expression image.

[0007] Optionally, in any of the preceding aspects, the computer- implemented method further comprising iteratively generating the first image at different levels of fatigue according to a difference between the first image and the second image at different time frames until a total value of the reconstructed loss and discriminator loss satisfy a predetermined criteria.

[0008] Optionally, in any of the preceding aspects, wherein generating the multiple intermediate images further comprises predicting an intermediate image between the first image and the second image during the corresponding optical flow; and interpolating the first image and the second image to generate the corresponding optical flow in which to generate the fake fatigued-state video of the driver. [0009] Optionally, in any of the preceding aspects, wherein generating the multiple intermediate images further comprises receiving a sequence of intermediate images arranged in an input order; processing the sequence of intermediate images using an encoder to convert the sequence of intermediate images into an alternative representation of the sequence of intermediate images; and processing the alternative representation of the sequence of intermediate images using a decoder to generate a target sequence of the sequence of intermediate images, the target sequence including multiple outputs arranged according to an output order.

[0010] Optionally, in any of the preceding aspects, wherein the first representation maps the multiple second facial expression images to the first representation through a learned distribution.

[0011] Optionally, in any of the preceding aspects, wherein the second representation maps the third facial expression image to the second representation through a learned distribution.

[0012] According to one other aspect of the present disclosure, there is provided a device for training an application to recognize driver fatigue, comprising a non-transitory memory storage comprising instructions; and one or more processors in communication with the memory, wherein the one or more processors execute the instructions to: generate multiple first facial expression images from multiple second facial expression images using a first neural network, wherein the multiple first facial expression images are reconstructed from a first representation of the multiple second facial expression images learned from the first neural network; generate a first image, expressing a current level of fatigue, from a third facial expression image and a second image, expressing a level of fatigue preceding the current level of fatigue, based on the first representation using a second neural network, wherein the first and second images are reconstructed from the first representation and a second representation of the third facial expression image learned from the second neural network; generate multiple intermediate images of interpolated video data from the first and second images during a corresponding optical flow, where the optical flow is formed by fusing the first and second images and is located in a time frame between the first and second images; and compile a fake fatigued-state video of a driver using at least the first and second images and the multiple intermediate images of the interpolated video data in which to train the application to detect the driver fatigue.

[0013] According to still one other aspect of the present disclosure, there is a non-transitory computer-readable medium storing computer instructions for training an application to recognize driver fatigue, that when executed by one or more processors, cause the one or more processors to perform the steps of generating multiple first facial expression images from multiple second facial expression images using a first neural network, wherein the multiple first facial expression images are reconstructed from a first representation of the multiple second facial expression images learned from the first neural network; generating a first image, expressing a current level of fatigue, from a third facial expression image and a second image, expressing a level of fatigue preceding the current level of fatigue, based on the first representation using a second neural network, wherein the first and second images are reconstructed from the first representation and a second representation of the third facial expression image learned from the second neural network; generating multiple intermediate images of interpolated video data from the first and second images during a corresponding optical flow, where the optical flow is formed by fusing the first and second images and is located in a time frame between the first and second images; and compiling a fake fatigued-state video of a driver using at least the first and second images and the multiple intermediate images of the interpolated video data in which to train the application to detect the driver fatigue.

[0014] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the Background. BRIEF DESCRIPTION OF THE DRAWINGS

[0015] Aspects of the present disclosure are illustrated by way of example and are not limited by the accompanying figures for which like references indicate elements.

[0016] FIG. 1 A illustrates a driver monitoring system according to an embodiment of the present technology.

[0017] FIG. 1 B illustrates a detailed example of the driver monitoring system in accordance with FIG. 1A.

[0018] FIG. 2. illustrates an example of an expression recognition network.

[0019] FIG. 3 illustrates an example facial fatigue level generator network.

[0020] FIG. 4A illustrates a video prediction and interpolation network.

[0021] FIG. 4B illustrates an example frame interpolation network in accordance with FIG. 4A.

[0022] FIG. 4C illustrates an example of the video prediction and interpolation network of FIG. 4A with an expanded view of the LSTM auto-encoder.

[0023] FIGS. 5A - 5D illustrate example flow diagrams in accordance with embodiments of the present technology.

[0024] FIG. 6 illustrates a computing system upon embodiments of the disclosure may be implemented.

DETAILED DESCRIPTION

[0025] The present disclosure will now be described with reference to the figures, which in general relate to a driver attention detection.

[0026] The technology relates to detection of driver fatigue, also known as driver drowsiness, tiredness and sleepiness, for a specific driver using an application trained from a fake fatigue-state video dataset. Traditional datasets used to train applications to detect driver fatigue are typically based on public datasets that are not specific to individual drivers. Oftentimes, this results in the application detecting driver fatigue when none exists, or failing to detect driver fatigue when it does. In embodiments, the disclosed technology generates personalized fake fatigue-state video datasets that is associated with a specific or individual driver. The datasets are generated by interpolating a sequence of images and predicting a next frame or sequence of images using various machine learning techniques and neural networks.

[0027] It is understood that the present embodiments of the disclosure may be implemented in many different forms and that claim scope should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the inventive embodiment concepts to those skilled in the art. Indeed, the disclosure is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the disclosure as defined by the appended claims. Furthermore, in the following detailed description of the present embodiments of the disclosure, numerous specific details are set forth in order to provide a thorough understanding. However, it will be clear to those of ordinary skill in the art that the present embodiments of the disclosure may be practiced without such specific details.

[0028] FIG. 1A illustrates a driver distraction system according to an embodiment of the present technology. The driver distraction system 106 is shown as being installed or otherwise included within a vehicle 101 that also includes a cabin within which a driver 102 can sit. The driver distraction system 106, or one or more portions thereof, can be implemented by an in-cabin computer system, and/or by a mobile computing device, such as, but not limited to, a smartphone, tablet computer, notebook computer, laptop computer, and/or the like.

[0029] In accordance with certain embodiments of the present technology, the driver fatigue system 106 obtains (or collects), from one or more sensors, current data for a driver 102 of a vehicle 101. In other embodiments, the driver fatigue system 106 also obtains (or collects), from one or more databases 140, additional information about the driver 102 as it relates to features of the driver, such as facial features, historical head pose and eye gaze information, etc. The driver fatigue system 106 analyzes the current data and/or the additional information for the driver 102 of the vehicle 101 to thereby identify a driver’s head pose and eye gaze. In one embodiment, the driver fatigue system 106 additionally monitors and collects vehicle data and scene information, as described below. Such analysis may be performed using one or more computer implemented neural networks and/or some other computer implemented model, as explained below.

[0030] As shown in FIG. 1A, the driver fatigue system 106 is communicatively coupled to a capture device 103, which may be used to obtain current data for the driver of the vehicle 101 along with the vehicle data and scene information. In one embodiment, the capture device 103 includes sensors and other devices that are used to obtain current data for the driver 102 of the vehicle 101. The captured data may be processed by processor(s) 104, which includes hardware and/or software to detect and track driver movement, head pose and gaze direction. As will be described in additional detail below, with reference to FIG. 1 B, the capture device may additionally include one or more cameras, microphones or other sensors to capture data. In another embodiment, the capture device 103 may capture a forward facing scene of the route (e.g., the surrounding environment and/or scene information) on which the vehicle is traveling. Forward facing sensors may include, for example, radar sensors, laser sensors, lidar sensors, optical imaging sensors, etc. It is appreciated that the sensors may also cover the sides, rear and top (upward and downward facing) of the vehicle 101.

[0031] In one embodiment, the capture device 103 can be external to the driver fatigue system 106, as shown in FIG. 1A, or can be included as part of the driver fatigue system 106, depending upon the specific implementation. Additional details of the driver fatigue system 106, according to certain embodiments of the present technology, are described below with reference to FIG. 1 B.

[0032] Still referring to FIG. 1A, the driver fatigue system 106 is also shown as being communicatively coupled to various different types of vehicle related sensors 105 that are included within the vehicle 101. Such sensors 105 can include, but are not limited to, a speedometer, a global positioning system (GPS) receiver, and a clock. The driver fatigue system 106 is also shown as being communicatively coupled to one or more communication network(s) 130 that provide access to one or more database(s) 140 and/or other types of data stores. The database(s) 140 and/or other types of data stores can store vehicle data for the vehicle 101. Examples of such data include, but are not limited to, driving record data, driving performance data, driving license type data, driver facial features, drive head pose, driver gaze, etc. Such data can be stored within a local database or other data store that is located within the vehicle 101. However, the data is likely stored in one or more database(s) 140 or other data store(s) remotely located relative to the vehicle 101. Accordingly, such database(s) 140 or other data store(s) can be communicatively coupled to the driver distraction system via one or more communication networks(s) 130.

[0033] The communication network(s) 130 can include a data network, a wireless network, a telephony network, or any combination thereof. It is contemplated that the data network may be any local area network (LAN), metropolitan area network (MAN), wide area network (WAN), a public data network (e.g., the Internet), short range wireless network, or any other suitable packet-switched network. In addition, the wireless network may be, for example, a cellular network and may employ various technologies including enhanced data rates for global evolution (EDGE), general packet radio service (GPRS), global system for mobile communications (GSM), Internet protocol multimedia subsystem (IMS), universal mobile telecommunications system (UMTS), etc., as well as any other suitable wireless medium, e.g., worldwide interoperability for microwave access (WiMAX), Long Term Evolution (LTE) networks, code division multiple access (CDMA), wideband code division multiple access (WCDMA), wireless fidelity (Wi-Fi), wireless LAN (WLAN), Bluetooth®, Internet Protocol (IP) data casting, satellite, mobile ad-hoc network (MANET), and the like, or any combination thereof. The communication network(s) 130 can provide communication capabilities between the driver distraction system 106 and the database(s) 140 and/or other data stores, for example, via communication device 102 (FIG. 1 B).

[0034] While the embodiments of FIG. 1A are described with reference to a vehicle 101 , it is appreciated that the disclosed technology may be employed in a wide range of technological areas and is not limited to vehicles. For example, in addition to vehicles, the disclosed technology could be used in virtual or augmented reality devices or in simulators in which head pose and gaze estimations, vehicle data and/or scene information may be required. [0035] Additional details of the driver fatigue system 106, according to certain embodiments of the present technology, will now be described with reference to FIG. 1 B. The driver fatigue system 106 includes a capture device 103, one or more processors 108, a vehicle system 104, a machine learning engine 109, an input/output (I/O) interface 114, a memory 116, a visual/audio alert 118, a communication device 120 and database 140 (which may also be part of the driver fatigue system).

[0036] The capture device 103 may be responsible for monitoring and identifying driver behaviors (including fatigue) based on captured driver motion and/or audio data using one or more capturing devices positioned within the cab, such as sensor 103A, camera 103B or microphone 103C. In one embodiment, the capture device 103 is positioned to capture motion of the driver's head and face, while in other implementations movement of the driver's torso, and/or driver's limbs and hands are also captured. For example, the detection and tracking 108A, head pose estimator 108B and gaze direction estimator 108C can monitor driver motion captured by capture device 103 to detect specific poses, such as head pose, or whether the person is looking in a specific direction.

[0037] Still other embodiments include capturing audio data, via microphone 103C, along with or separate from the driver movement data. The captured audio may be, for example, an audio signal of the driver 102 captured by microphone 103C. The audio can be analyzed to detect various features that may vary in dependence on the state of the driver. Examples of such audio features include driver speech, passenger speech, music, etc.

[0038] Although the capture device 103 is depicted as a single device with multiple components, it is appreciated that each component (e.g., sensor, camera, microphone, etc.) may be a separate component located in different areas of the vehicle 101. For example, the sensor 103A, the camera 103B, the microphone 103C and the depth sensor 103D may each be located in a different area of the vehicle’s cab. In another example, individual components of the capture deice 103 may be part of another component or device. For example, camera 103B and visual/audio 118 may be part of a mobile phone or tablet (not shown) placed in the vehicle’s cab, whereas sensor 103A and microphone 103C may be individually located in a different place in the vehicle’s cab.

[0039] The detection and tracking 108A monitors facial features of the driver

102 captured by the capture device 103, which may then be extracted subsequent to detecting a face of the driver. The term facial features includes, but is not limited to, points (or facial landmarks) surrounding eyes, nose, and mouth regions as well as points outlining contoured portions of the detected face of the driver 102. Based on the monitored facial features, initial locations for one or more eye features of an eyeball of the driver 102 can be detected. The eye features may include an iris and first and second eye corners of the eyeball. Thus, for example, detecting the location for each of the one or more eye features includes detecting a location of an iris, detecting a location for the first eye corner and detecting a location for a second eye corner.

[0040] The head pose estimator 108B uses the monitored facial features to estimate a head pose of the driver 102. As used herein, the term “head pose” describes an angle referring to the relative orientation of the driver's head with respect to a plane of the capture device 103. In one embodiment, the head pose includes yaw and pitch angles of the driver's head in relation to the capture device plane. In another embodiment, the head pose includes yaw, pitch and roll angles of the driver's head in relation to the capture device plane.

[0041] The gaze direction estimator 108C estimates the driver's gaze direction (and gaze angle). In operation of the gaze direction estimator 108C, the capture device 103 may capture an image or group of images (e.g., of a driver of the vehicle). The capture device 103 may transmit the image(s) to the gaze direction estimator 108C, where the gaze direction estimator 108C detects facial features from the images and tracks (e.g., over time) the gaze of the driver. One such gaze direction estimator is the eye tracking system by Smart Eye Ab®.

[0042] In another embodiment, the gaze direction estimator 108C may detect eyes from a captured image. For example, the gaze direction estimator 108C may rely on the eye center to determine gaze direction. In short, the driver may be assumed to be gazing forward relative to the orientation of his or her head. In some embodiments, the gaze direction estimator 108C provides more precise gaze tracking by detecting pupil or iris positions or using a geometric model based on the estimated head pose and the detected locations for each of the iris and the first and second eye corners. Pupil and/or iris tracking enables the gaze direction estimator 108C to detect gaze direction de-coupled from head pose. Drivers often visually scan the surrounding environment with little or no head movement (e.g., glancing to the left or right (or up or down) to better see items or objects outside of their direct line of sight). These visual scans frequently occur with regard to objects on or near the road (e.g., to view road signs, pedestrians near the road, etc.) and with regard to objects in the cabin of the vehicle (e.g., to view console readings such as speed, to operate a radio or other in-dash devices, or to view/operate personal mobile devices). In some instances, a driver may glance at some or all of these objects (e.g., out of the corner of his or her eye) with minimal head movement. By tracking the pupils and/or iris, the gaze direction estimator 108C may detect upward, downward, and sideways glances that would otherwise go undetected in a system that simply tracks head position.

[0043] In one embodiment, and based on the detected facial features, the gaze direction estimator 108C may cause the processor(s) 108 to determine a gaze direction (e.g., for a gaze of an operator at the vehicle). In some embodiments, the gaze direction estimator 108C receives a series of images (and/or video). The gaze direction estimator 108C may detect facial features in multiple images (e.g., a series or sequence of images). Accordingly, the gaze direction estimator 108C may track gaze direction over time and store such information, for example, in database 140.

[0044] The processor 108, in addition to the afore-mentioned pose and gaze detection, may also include an image corrector 108D, video enhancer 108E, video scene analyzer 108F and/or other data processing and analytics to determine scene information captured by capture device 103.

[0045] Image corrector 108D receives captured data and may undergo correction, such as video stabilization. For example, bumps on the roads may shake, blur, or distort the data. The image corrector may stabilize the images against horizontal and/or vertical shake, and/or may correct for panning, rotation, and/or zoom. [0046] Video enhancer 108E may perform additional enhancement or processing in situations where there is poor lighting or high data compression. Video processing and enhancement may include, but are not limited to. gamma correction, de-hazing, and/or de-blurring. Other video processing enhancement algorithms may operate to reduce noise in the input of low lighting video followed by contrast enhancement techniques, such but not limited to, tone-mapping, histogram stretching and equalization, and gamma correction to recover visual Information in Sow lighting videos.

[0047] The video scene analyzer 108F may recognize the content of the video coming in from the capture device 103. For example, the content of the video may include a scene or sequence of scenes from a forward facing camera 103B in the vehicle. Analysis of the video may involve a variety of techniques, including but not limited to, low-level content analysis such as feature extraction, structure analysis, object detection, and tracking, to high-level semantic analysis such as scene analysis, event detection, and video mining. For example, by recognizing the content of the incoming video signals, it may be determined if the vehicie 101 is driving along a freeway or within city limits, if there are any pedestrians, animals, or other objects/obstacles on the road, etc. By performing image processing (e.g., image correction, video enhancement, etc.) prior to or simultaneously while performing image analysis (e.g., video scene analysis, etc.), the image data may be prepared in a manner that is specific to the type of analysis being performed. For example, image correction to reduce blur may allow video scene analysis to be performed more accurately by clearing up the appearance of edge lines used for object recognition.

[0048] Vehicie system 104 may provide a signal corresponding to any status of the vehicie, the vehicle surroundings, or the output of any other information source connected to the vehicle. Vehicle data outputs may include, for example, analog signals (such as current velocity), digital signals provided by individual information sources (such as clocks, thermometers, location sensors such as Global Positioning System [GPS] sensors, etc.), digital signals propagated through vehicle data networks (such as an engine controller area network (CAN) bus through which engine related information may be communicated, a climate control CAN bus through which climate control related information may be communicated, and a multimedia data network through which multimedia data is communicated between multimedia components in the vehicle). For example, the vehicle system 104 may retrieve from the engine CAN bus the current speed of the vehicle estimated by the wheel sensors, a power state of the vehicle via a battery and/or power distribution system of the vehicle, an ignition state of the vehicle, etc.

[0049] Input/output interface(s) 114 allow information to be presented to the user and/or other components or devices using various input/output devices. Examples of input devices include a keyboard, a microphone, touch functionality (e.g., capacitive or other sensors that are configured to detect physical touch), a camera (e.g., which may employ visible or non-visible wavelengths such as infrared frequencies to recognize movement as gestures that do not involve touch), and so forth. Examples of output devices include a visual/audio alert 118, such as a display, speakers, and so forth. In one embodiment, I/O interface 114 receives the driver motion data and/or audio data of the driver 102 from the capturing device 103. The driver motion data may be related to, for example, the eyes and face of the driver 102, which may be analyzed by processor(s) 108.

[0050] Data collected by the driver fatigue system 106 may be stored in database 140, in memory 116 or any combination thereof. In one embodiment, the data collected is from one or more sources external to the vehicle 101. The stored information may be data related to driver distraction and safety, such as information captured by capture device 103. In one embodiment, the data stored in database 140 may be a collection of data collected for one or more drivers of vehicle 101. In one embodiment, the collected data is head pose data for a driver of the vehicle 101. In another embodiment, the collected data is gaze direction data for a driver of the vehicle 101. The collected data may also be used to generate datasets and information that may be used to train models for machine learning, such as machine learning engine 109.

[0051] In one embodiment, memory 116 can store instructions executable by the processor(s) 108, a machine learning engine 109, and programs or applications (not shown) that are loadable and executable by processor(s) 108. In one embodiment, machine learning engine 109 comprises executable code stored in memory 116 that is executable by processor(s) 108 and selects one or more machine learning models stored in memory 116 (or database 140). The machine models can be developed and trained using well known and conventional machine learning and deep learning techniques, such as implementation of a convolutional neural network (CNN), using for example datasets generated in accordance with embodiments found below.

[0052] FIG. 2 illustrates an example of an expression recognition network.

The expression recognition network 202 receives arbitrary facial images 201 A, which may be captured using a capture device, such as camera 103, scanner, database of images, such as database 104, and the like. The arbitrary facial images 201 A, which may be arbitrary in nature, are processed by the expression recognition network 202, with an auto-encoding style network architecture, to output facial expression images 201 B. The representation learned during auto-encoding learning will then be used to assist in forming a dataset to train machine models, such as those described above. The machine models may then be used to generate a fake fatigued-state video of drivers, which may be used in conjunction with personalized data to train an application to detect driver fatigue (i.e. , drowsiness, sleepiness or tiredness) of specific drivers.

[0053] In performing the auto-encoding learning, the arbitrary facial images

201 A input into the expression recognition network 202 are classified into categories or classes. In one embodiment, the input images arbitrary facial 201 A are arbitrary facial expressions, such as anger, fear or neutral images, and the output facial expression images 201 B are facial expression or emotion images that have been classified into the categories or classes, such as disgust, sadness, joy or surprise.

[0054] In one example, the expression recognition network 202 generates facial expression images 201 B from input arbitrary facial images 201 A using a neural network, such as an auto-encoder (AE) or a conditional variational auto-encoder (CVAE). The expression recognition network 202 effectively aims to learn a latent or learned representation (or code), i.e., learned representation z_g, which generates an output expression 201 B from the arbitrary facial images 201 A. For example, an arbitrary image of fear may generate a facial expression image of surprise using the learned representation z_g. [0055] Learning occurs in layers (e.g., encoder and decoder layers) attached to the learned representation z. For example, the input arbitrary facial image 201 A is input into a first layer (e.g., encoder 204). The learned representation z_g compresses (reduces) the size of the input arbitrary facial images 201 A. Reconstruction of the input arbitrary facial images 201 A occurs in a second layer (e.g., decoder 206), which outputs the facial expression images 201 B that correspond to the input arbitrary facial image 201 A. More specifically, the expression recognition network 202 is trained to encode the input arbitrary facial images 201 A into a learned representation z_g, such that the input arbitrary facial images 201 A can be reconstructed from the learned presentation z_g. In one embodiment, the encoder 204 creates the learned representation z_g according to z=o(Wx+b), where W is an encoding weight, ‘b’ is a bias vector, ‘o’ is a logistic function (such as a sigmoid function or a rectified linear unit), and‘x’ is the input arbitrary facial images 201 A. The expression recognition network 202 also contains decoder 206 that reconstructs the input arbitrary facial images 201 A according to x’= o’(Wz+b’), where“W” is a decoding weight, “b”’ is a bias vector, “o’” is a logistic function, and“x”’ is the output facial expression images 201 B. The learning consists of minimizing a reconstruction error with respect to encoding and decoding such that

[0056] The learned representation z_g may then be used in training additional machine models, as explained below with reference to FIG. 3.

[0057] FIG. 3 illustrates a facial fatigue level generator network. The facial fatigue level generator network 302 includes a CVAE 304 and a generative adversarial network (GAN) 306. The facial fatigue level generator network 302 receives content, such as a sequence of images or video, that is processed to identify whether the input content is“real” or“fake” content.

[0058] In one embodiment, the CVAE 304 is coupled to receive the content that is processed to output a reconstructed version of the content. In particular, the CVAE 304 receives a flow F_t-1®t of facial expression images, where the flow F includes a frame of images from the (/-1 )th to the /^th frame of images. The flow F_£.1®£ of facial expression images includes facial expression images from

different levels Lo, Li-i to Li, respectively, where represents an identification (ID) of

the specific individual with a natural or neutral facial expression (e.g., the specific individual facial expression shown in a normal or plain state of expression) and

and represent a facial expression image from a calculated at a preceding and current level (i.e. , calculated during a preceding or current iteration). In one embodiment, the facial expression image is a facial fatigue image.

[0059] As illustrated, the CVAE 304 includes an encoder 304A and a decoder (/generator) 306A. The encoder 304A receives the flow F_t.1®i of facial expression images at the different levels Lo, Li-i to Li, and maps each of the facial expression images to a learned representation z; through a learned distribution P(z|x,c), where“c” is the category or class of the data and“x” is the image z = z, + z_g. That is, the flow of facial expression images are transformed into the learned representation z; (e.g., a feature vector), which may be thought of as a compressed representation of the input to the encoder 304A. In one embodiment, the encoder 304A is a convolutional neural network (CNN).

[0060] The decoder 306A serves to invert the output of the encoder 304A using the learned representation z; concatenated with the learned representation z_g (FIG. 2), as shown. The concatenated learned representation (z,+z_g) is then used to generate a reconstructed version of the input from the encoder 304A. This reconstruction of the input is referred to as reconstructed image at level LM.

The reconstructed image represents a facial expression image showing

different levels of fatigue (e.g., drowsiness, sleepiness, tiredness, etc.) for each frame in the flow of facial expressions during each iteration.

[0061] The GAN 306 includes a generator (/decoder) 306A and discriminator

306B. In one embodiment, the GAN 306 is a CNN. The generator 306A receives the concatenated learned representation (z,+z_g) as input and outputs the reconstructed image as explained above. The GAN 306 also includes a discriminator 306B. The discriminator 306B is coupled to receive the original content and the reconstructed content (e.g., ) from the generator 306A and

learns to distinguish between“real” and“fake” samples of the content (i.e. , predict whether the reconstructed content is real or fake). This may be accomplished by training the discriminator 306B to reduce a discriminator loss LOSSGD, which indicates the cost of generating an incorrect prediction by the discriminator responsive to receiving the original content and the reconstructed content generated by CVAE 304. In this manner, parameters of the discriminator 306B are configured to discriminate between training and reconstructed versions of content based on the differences between the two versions that arise during the encoding process. For example, the discriminator 306B receives as input reconstructed image (or natural facial

expression image if at an initial level L₀) and ground truth image at level L_i+1.

To predict whether the reconstructed image s real or fake, the discriminator

306B is trained to minimize or reduce a loss function of the GAN 306. The minimized loss function of the GAN 306 is defined as minGAN Loss = LOSSGD + LOSSEP, where LOSSGD represents a discriminator loss and LOSSEP represents a reconstruction loss. LOSSGD may be calculated using the function:

L and LOSSEP may be calculated using the function: where D() is the discriminator, G() is the generator,

E[] is the expectation and z is the learned representation (or code).

[0062] The facial fatigue generation network 302 predicts whether the reconstructed images is real or fake using the loss functions, where a value of the minimized loss function for real images should be lower than the minimized loss function for fake images. In one embodiment, original content is assigned a label of 1 (real), while reconstructed content determined to be fake (not real) is assigned a label of 0 (fake). In this case, the discriminator 306B may predict the input content to be a reconstructed (i.e., fake) version when the corresponding discrimination prediction is below a threshold value, and may be predict the input content to be real (real image ) when the corresponding prediction is above a threshold value. In

a next iteration, the image is replaced with the real image and the image

is replaced with image In this way, the discriminator 306B outputs a

discrimination prediction for each input content and reconstructed content indicating whether the input content is the original or the reconstructed version.

[0063] It is appreciated that the generator and/or the discriminator can include various types of machine-learned models. Machine-learned models can include linear models and non-linear models. As examples, machine-learned models can include regression models, support vector machines, decision tree-based models, Bayesian models, and/or neural networks (e.g., deep neural networks). Neural networks can include feed-forward neural networks, recurrent neural networks (e.g., long short-term memory recurrent neural networks), convolutional neural networks or other forms of neural networks. Thus, although the generator and discriminator are sometimes referred to as “networks”, they are not necessarily limited to being neural networks but can also include other forms of machine-learned models.

[0064] FIG. 4A illustrates a video prediction and interpolation network. The video prediction and interpolation network 402 includes a frame interpolation network 402A and a long short term memory (LSTM) auto-encoder network 402B. The LSTM effectively preserves motion trends (patterns) and transfers the motion trends to predicted frames, while the interpolation network generates intermediate images from broader frames. Thus, frames may be interpolated while the motion trend is maintained.

[0065] The frame interpolation network 402A is used to generate new frames from original frames of video content. In doing so, the network predicts one or more intermediate images at timesteps (or timestamps) defined between two consecutive frames. A first neural network 410 approximates optical flow data defining motion between the two consecutive frames. A second neural network 412 refines the optical flow data and predicts visibility maps for each timestep. The two consecutive frames are warped according to the refined optical flow data for each timestep to produce pairs of warped frames for each timestep. The second neural network then fuses the pair of warped frames based on the visibility maps to produce the intermediate images for each timestep. Artifacts caused by motion boundaries and occlusions are reduced in the predicted intermediate images.

[0066] One example of frame interpolation is disclosed in the illustrated embodiment. When the frame interpolation network 402A is provided with two input images, such as images at a time t Î (t-1,t), an intermediate (or

interpolated) image can be predicted. In order to perform the interpolation, a bi-

directional optical flow between the two input images is first computed. In one embodiment, a CNN may be leveraged to compute the optical flow. For example, a CNN can be trained, using the two input images to jointly predict the

forward optical flow F_t+1®t and the backward optical flow F_t®t-1 between the two input images (the optical flow between frames). Similarly, the frame interpolation network 402A may receive images where the forward optical flow F_t+1®t and the

backward optical flow F_t+1®t are jointly predicted between the input images. The frame interpolation network, as described in more detail below with reference to FIG. 4B, processes the input images and outputs intermediate images

[0067] One example of an interpolation network is described in “Super

SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpretation,” Jiang et al. , Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

[0068] The LSTM auto-encoder network 402B learns representations of image sequences. In one embodiment, the LSTM auto-encoder network 402B uses recurrent neural nets (RNNs) made of LSTM units or memory blocks to perform learning. For example, a first RNN is an encoder that maps an input image sequence 404 (e.g., a sequence of image frames) into a fixed length representation, which is then decoded using a second RNN, such as a decoder. The input image sequence 404, such as images is processed by the encoder in the LSTM auto-encoder network 402B to generate the representation for the input image sequence 404.

[0069] After the last input has been read, the learned representation generated using the input sequence is then processed using the decoder in the LSTM auto-encoder network 402B. The decoder outputs a prediction for a generated target sequence for the input sequence. The target sequence is the same as the input sequence in reverse order. In one embodiment, the decoder in the LSTM auto- encoder network 402B includes one or more LSTM layers and is configured to receive a current output in the target sequence so as to generate a respective output score. The output score for a given output represents the likelihood that the output is the next output in the target sequence, i.e., predicts whether the output represents the next output in the target sequence. As part of generating the output scores, the decoder also updates the hidden state of the network to generate an updated hidden state. A further explanation of the LSTM auto-encoder network 402B is found below with reference to FIG. 4C.

[0070] An example LSTM neural network is described in “Unsupervised

Learning of Video Representations using LSTMs,” Srivastava et al. , University of Toronto, January 2016.

[0071] FIG. 4B illustrates an example frame interpolation network in accordance with FIG. 4A. The frame interpolation network 402A includes encoder 410 and decoder 412 that fuses warped input images to generate the intermediate image More specifically, the two input images are warped to the

specific timestep t and the two warped images are adaptively fused to generate the intermediate image In one embodiment, a flow computation CNN is used to estimate the bi-direction optical flow between the two input images and a flow interpolation CNN is used to refine the flow approximations and predict visibility maps. The visibility maps may then be applied to the two warped images prior to fusing, such that artifacts in the interpolated intermediate image are reduced. In one embodiment, the flow computation CNN and the flow interpolation CNN is a U-Net architecture as described in“U-net: Convolutional networks for biomedical image segmentation,” MIC-CAI, 2015. [0072] In the illustrated embodiment, and for each of the input images and the frame interpolation network 402A includes a flow interpolation network,

which in one embodiment is implemented as an encoder 410, an intermediate optical flow network, which in one embodiment is implemented as a decoder 412, and optical flow warping units The encoder 410 receives a sequential image pair at timestamps (t, t+1). Bi-directional optical flows are computed based on the sequential image pair The bi-

directional optical flows are linearly combined to approximate intermediate bi- directional optical flows for at least one timestep t between the two input images in the sequential image pair. Each of the input images are warped (backward) according to the approximated intermediate bi-directional optical flows for each timestep to produce warped input frames l_i®t and l_i+1®t.

[0073] In one embodiment, the decoder 412 includes a flow refinement network (not shown) that corresponds to each warping unit and an image predictor (not shown) to predict the intermediate image l_tn at a time t e (i,i+1 ). The intermediate bi-directional optical flows are refined for each timestep

using the two input frames, the intermediate bi-directional optical flows, and the two warped input images. The refined intermediate bi-directional optical flows (F_{t i,}, F_{t i+1}) are output and processed by an image prediction unit to produce the intermediate image In an embodiment, the image prediction unit receives the warped input frames l_{t i} and I_{t i+1} generated by the optical flow warping units, and the warped input frames are linearly fused by the image prediction unit to produce the intermediate image for each timestep.

[0074] As noted above, in one embodiment, visibility maps are applied to the two warped images. To account for occlusion, a flow refinement network in the decoder 412 predicts visibility maps V_{t i} and V_{t i+1} for each timestep. Since visibility maps are used, when a pixel is visible in both the decoder 412 learns to

adaptively combine the information from the two images. In one embodiment, the visibility maps are applied to the warped images before the warped images are linearly fused so as to produce the intermediate image for each timestep.

[0075] The intermediate images are synthesized according to:

s a normalization factor, and the interpolation

loss function is defined according to:

where m(·) is a backward warping function are intermediate

optical flow between the two input images are

two visibility maps denoting whether the pixel remains visible, as described in“U-net: Convolutional networks for biomedical image segmentation.”

[0076] By applying the visibility maps to the warped images before fusion, the contribution of any occluded pixels is excluded from the interpolated intermediate image thereby avoiding or reducing artifacts.

[0077] FIG. 4C illustrates an example of the video prediction and interpolation network of FIG. 4A with an expanded view of the LSTM auto-encoder. The basic building block of the LSTM auto-encoder 402B is a LSTM memory cell, represented by RNN. Each LSTM memory cell has a state at time t. For purposes of accessing a memory cell for reading or modification, each LSTM memory block may include one or more cells. Each cell includes an input gate, a forget gate, and an output gate that allow the cell to store previous activations generated by the cell, e.g., as a hidden state for use in generating a current activation or to be provided to other components of the LSTM auto-encoder 402B. As noted above, these LSTM memory blocks form the RNNs in which to perform learning.

[0078] In the illustrated embodiment, encoder 403 consists of multilayered

RNNs, where the arrows show the direction of information flow. Each of the encoders 403 (in each layer) receives a single element of the input image sequence along with a corresponding intermediate image generated by the frame interpolation network 402A - 402N, respectively. The input sequence which is a collection of images collected by the

sensors or from a database, is processed and the current hidden state is updated, i.e. , to modify the current hidden state that has been generated by processing previous inputs from the input sequence by processing the current received input. A respective weight wi may then be applied to the previously hidden state and the input vector.

[0079] The learned representation of the input sequence is then processed using decoder 405 to generate the target sequence for the input sequence. The decoder 405 also includes multilayered RNNs, where the arrows show the direction of information flow, which predicts an output at timestep t. Each RNN accepts a hidden state from the previous element and produces and outputs its own hidden state. In one embodiment, the outputs are calculated using the hidden state at the current timestep together with a respective weight w_å. The final output may be determined using a probability vector using, for example, Softmax or some other known function.

[0080] FIGS. 5A - 5D illustrate example flow diagrams in accordance with embodiments of the present technology. In embodiments, the flow diagrams may be computer-implemented methods performed, at least partly, by hardware and/or software components illustrated in the various figures and as described herein. In one embodiment, the disclosed process may be performed by the driver fatigue system 106 disclosed in FIGS. 1A and 1 B. In one embodiment, software components executed by one or more processors, such as processor(s) 108 or processor 802, perform at least a portion of the process.

[0081] FIG. 5A illustrates a flow diagram of compilation of a fake fatigued- state video dataset. The dataset, which is generated using transfer learning techniques, may be used to train an application to recognize driver fatigue. At step 502, a neural network, such as an AE or CVAE, generates facial expression images learned from arbitrary facial images. In one embodiment, the facial expression images are reconstructed from a learned representation of the arbitrary facial images learned from the neural network. The learned representation may then be applied during a training stage of a fatigue level generator network.

[0082] At step 504, another neural network is trained using a neutral, natural or normal facial image in which little to no expression is visible, and a flow of images expressing varying levels of fatigue. Using the neutral facial image and flow of images, an image expressing a current level of fatigue is generated. As the process iterates, the flow of images changes such that the current image becomes the preceding image and a new current image is generated. At each iteration, an image expressing a current level of fatigue is generated from the neutral facial image and the image expressing a level of fatigue preceding the current level of fatigue based on the learned representation. The image expressing a current level of fatigue (or the reconstructed image) is reconstructed from the representation learned in step 502 and a second representation of the neutral facial image learned from the neural network. The reconstructed image may then be compared to a ground truth model to determine whether the reconstructed image is real or fake, as discussed below.

[0083] Intermediate images of interpolated video data (sequential image data) from the reconstructed facial expression and arbitrary facial images during a corresponding optical flow are generated at step 506. In one embodiment, the optical flow is formed by fusing the reconstructed facial expression and arbitrary facial images and is located in a time frame between the reconstructed facial expression and arbitrary facial images.

[0084] At step 508, a fake fatigued-state video (i.e. , the dataset) of a driver is compiled using at least the reconstructed facial expression and arbitrary facial images and the intermediate images of the interpolated video data in which to train the application to detect driver fatigue.

[0085] Turning to FIG. 5B, an example flow diagram of a neural network, such as CVAE 304, as illustrated. In the depicted embodiment, the CVAE includes an encoder and a decoder. At step 510, the facial expression image, including the level, and the neutral facial image are encoded and parameters describing a distribution for each dimension of the learned representation are output. The distribution for each dimension of the learned representation is decoded and the relationship of each parameter with respect to an output loss is calculated to reconstruct the neutral facial image and the facial expression image as a reconstructed image, at step 512. At step 514, the reconstructed image is compared to the neutral image to generate a discriminator loss, and the reconstructed image is compared to a ground truth at a same level to generate a reconstructed loss, at step 516. Based on the discriminator loss and the reconstructed loss, a predication is made as to the likelihood that the reconstructed image has an appearance that corresponds to the neutral image, at step 518. The reconstructed image is output as the real image and backward propagated to the input of the CVAE for the next iteration at step 520.

[0086] In FIG. 5C, the neural network in step 502 maps the arbitrary facial images to a corresponding learned representation at 522, and the learned representation is mapped to the facial expression images with a same shape or same image size as the arbitrary facial images (e.g., the reconstructed image has the same number of columns and rows as the arbitrary image), at step 524.

[0087] With continued reference to FIG. 5D, generation of the intermediate images is described with reference to GAN 306. At step 526, an intermediate image between the facial expression image and the arbitrary facial image is predicted during the corresponding optical flow. The images are interpolated to generate the corresponding optical flow in which to generate the fake fatigued-state video of the driver, at step 528. At step 530, a sequence of intermediate images are arranged in an input order, and the sequence of intermediate images are processed using an encoder to convert the sequence of intermediate images into an alternative representation, at step 532. Finally, at step 534, the alternative representation of the sequence of intermediate images is processed using a decoder to generate a target sequence of the sequence of intermediate images, where the target sequence includes multiple outputs arranged according to an output order.

[0088] FIG. 6 illustrates a computing system upon embodiments of the disclosure may be implemented. Computing system 600 may be programmed (e.g., via computer program code or instructions) to provide enhanced safety to drivers using driver fatigue (tiredness) detection as described herein and includes a communication mechanism such as a bus 610 for passing information between other internal and external components of the computer system 600. In one embodiment, the computer system 600 is system 106 of FIG. 1 B. Computer system 600, or a portion thereof, constitutes a means for performing one or more steps for providing enhanced safety to drivers using the driver distraction (including driver fatigue) detection.

[0089] A bus 610 includes one or more parallel conductors of information so that information is transferred quickly among devices coupled to the bus 610. One or more processors 602 for processing information are coupled with the bus 610.

[0090] One or more processors 602 performs a set of operations on information (or data) as specified by computer program code related to for provide enhanced safety to drivers using driver distraction detection. The computer program code is a set of instructions or statements providing instructions for the operation of the processor and/or the computer system to perform specified functions. The code, for example, may be written in a computer programming language that is compiled into a native instruction set of the processor. The code may also be written directly using the native instruction set (e.g., machine language). The set of operations include bringing information in from the bus 610 and placing information on the bus 610. Each operation of the set of operations that can be performed by the processor is represented to the processor by information called instructions, such as an operation code of one or more digits. A sequence of operations to be executed by the processor 602, such as a sequence of operation codes, constitute processor instructions, also called computer system instructions or, simply, computer instructions.

[0091] Computer system 600 also includes a memory 604 coupled to bus 610. The memory 604, such as a random access memory (RAM) or any other dynamic storage device, stores information including processor instructions for providing enhanced safety to drivers using driver distraction detection. Dynamic memory allows information stored therein to be changed by the computer system 600. RAM allows a unit of information stored at a location called a memory address to be stored and retrieved independently of information at neighboring addresses. The memory 604 is also used by the processor 602 to store temporary values during execution of processor instructions. The computer system 600 also includes a read only memory (ROM) 606 or any other static storage device coupled to the bus 610 for storing static information. Also coupled to bus 610 is a non-volatile (persistent) storage device 608, such as a magnetic disk, optical disk or flash card, for storing information, including instructions.

[0092] In one embodiment, information, including instructions for providing enhanced safety to tired drivers using information processed by the aforementioned system and embodiments, is provided to the bus 610 for use by the processor from an external input device 612, such as a keyboard operated by a human user, a microphone, an Infrared (IR) remote control, a joystick, a game pad, a stylus pen, a touch screen, head mounted display or a sensor. A sensor detects conditions in its vicinity and transforms those detections into physical expression compatible with the measurable phenomenon used to represent information in computer system 600. Other external devices coupled to bus 610, used primarily for interacting with humans, include a display device 614 for presenting text or images, and a pointing device 616, such as a mouse, a trackball, cursor direction keys, or a motion sensor, for controlling a position of a small cursor image presented on the display 614 and issuing commands associated with graphical elements presented on the display 614, and one or more camera sensors 684 for capturing, recording and causing to store one or more still and/or moving images (e.g., videos, movies, etc.) which also may comprise audio recordings.

[0093] In the illustrated embodiment, special purpose hardware, such as an application specific integrated circuit (ASIC) 620, is coupled to bus 610. The special purpose hardware is configured to perform operations not performed by processor 602 quickly enough for special purposes.

[0094] Computer system 600 also includes a communications interface 670 coupled to bus 610. Communication interface 670 provides a one-way or two-way communication coupling to a variety of external devices that operate with their own processors. In general the coupling is with a network link 678 that is connected to a local network 680 to which a variety of external devices, such as a server or database, may be connected. Alternatively, link 678 may connect directly to an Internet service provider (ISP) 684 or to network 690, such as the Internet. The network link 678 may be wired or wireless. For example, communication interface 670 may be a parallel port or a serial port or a universal serial bus (USB) port on a personal computer. In some embodiments, communications interface 670 is an integrated services digital network (ISDN) card or a digital subscriber line (DSL) card or a telephone modem that provides an information communication connection to a corresponding type of telephone line. In some embodiments, a communication interface 670 is a cable modem that converts signals on bus 610 into signals for a communication connection over a coaxial cable or into optical signals for a communication connection over a fiber optic cable. As another example, communications interface 670 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN, such as Ethernet. Wireless links may also be implemented. For wireless links, the communications interface 670 sends and/or receives electrical, acoustic or electromagnetic signals, including infrared and optical signals, which carry information streams, such as digital data. For example, in wireless handheld devices, such as mobile telephones like cell phones, the communications interface 670 includes a radio band electromagnetic transmitter and receiver called a radio transceiver. In certain embodiments, the communications interface 670 enables connection to a communication network for providing enhanced safety to tired drivers using mobile devices, such as mobile phones or tablets.

[0095] Network link 678 typically provides information using transmission media through one or more networks to other devices that use or process the information. For example, network link 678 may provide a connection through local network 680 to a host computer 682 or to equipment 684 operated by an ISP. ISP equipment 684 in turn provide data communication services through the public, world-wide packet- switching communication network of networks now commonly referred to as the Internet 690.

[0096] A computer called a server host 682 connected to the Internet hosts a process that provides a service in response to information received over the Internet. For example, server host 682 hosts a process that provides information representing video data for presentation at display 614. It is contemplated that the components of system 600 can be deployed in various configurations within other computer systems, e.g., host 682 and server 682. [0097] At least some embodiments of the disclosure are related to the use of computer system 600 for implementing some or all of the techniques described herein. According to one embodiment of the disclosure, those techniques are performed by computer system 600 in response to processor 602 executing one or more sequences of one or more processor instructions contained in memory 604. Such instructions, also called computer instructions, software and program code, may be read into memory 604 from another computer-readable medium such as storage device 608 or network link 678. Execution of the sequences of instructions contained in memory 604 causes processor 602 to perform one or more of the method steps described herein.

[0098] It is understood that the present subject matter may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and complete and will fully convey the disclosure to those skilled in the art. Indeed, the subject matter is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the subject matter as defined by the appended claims. Furthermore, in the following detailed description of the present subject matter, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. However, it will be clear to those of ordinary skill in the art that the present subject matter may be practiced without such specific details.

[0099] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

[00100] The computer-readable non-transitory media includes all types of computer readable media, including magnetic storage media, optical storage media, and solid state storage media and specifically excludes signals. It should be understood that the software can be installed in and sold with the device. Alternatively the software can be obtained and loaded into the device, including obtaining the software via a disc medium or from any manner of network or distribution system, including, for example, from a server owned by the software creator or from a server not owned but used by the software creator. The software can be stored on a server for distribution over the Internet, for example.

[00101] Computer-readable storage media (medium) exclude (excludes) propagated signals per se, can be accessed by a computer and/or processor(s), and include volatile and non-volatile internal and/or external media that is removable and/or non-removable. For the computer, the various types of storage media accommodate the storage of data in any suitable digital format. It should be appreciated by those skilled in the art that other types of computer readable medium can be employed such as zip drives, solid state drives, magnetic tape, flash memory cards, flash drives, cartridges, and the like, for storing computer executable instructions for performing the novel methods (acts) of the disclosed architecture.

[00102] The terminology used herein is for the purpose of describing particular aspects only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

[00103] The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated.

[00104] For purposes of this document, each process associated with the disclosed technology may be performed continuously and by one or more computing devices. Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device.

[00105] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

CLAIMS What is claimed is:

1. A computer-implemented method for training an application to recognize driver fatigue:

generating multiple first facial expression images from multiple second facial expression images using a first neural network, wherein the multiple first facial expression images are reconstructed from a first representation of the multiple second facial expression images learned from the first neural network;

generating a first image, expressing a current level of fatigue, from a third facial expression image and a second image, expressing a level of fatigue preceding the current level of fatigue, based on the first representation using a second neural network, wherein the first and second images are reconstructed from the first representation and a second representation of the third facial expression image learned from the second neural network;

generating multiple intermediate images of interpolated video data from the first and second images during a corresponding optical flow, where the optical flow is formed by fusing the first and second images and is located in a time frame between the first and second images; and

compiling a fake fatigued-state video of a driver using at least the first and second images and the multiple intermediate images of the interpolated video data in which to train the application to detect the driver fatigue.

2. The computer-implemented method of claim 1 , wherein the first neural network performs the steps of:

mapping the multiple second facial expression images to a corresponding first representation; and

mapping the corresponding first representation to the multiple first facial expression images having a same shape as the multiple second facial expression images.

3. The computer-implemented method of claim 1 , wherein the second neural network comprises a conditional variational auto-encoder that performs the steps of:

encoding the third facial expression image and the second image and outputting parameters describing a distribution for each dimension of the second representation; and

decoding the distribution for each dimension of the second representation by calculating the relationship of each parameter with respect to an output loss to reconstruct the third facial expression image and the second image.

4. The computer-implemented method of any of claims 1 -3, where the second neural network further comprises a generative adversarial network that performs the steps of:

comparing the reconstructed image to the third facial expression image to generate a discriminator loss;

comparing the reconstructed image to a ground truth image at a same level to generate a reconstructed loss;

predicting a likelihood that the reconstructed image has an appearance that corresponds to the third facial expression image based on the discriminator loss and the reconstructed loss; and

outputting the reconstructed image as the first image, expressing a current level of fatigue, for input to the conditional variational auto-encoder as the second image, expressing a level of fatigue preceding the current level of fatigue, when the prediction classifies the first image as real.

5. The computer-implemented method of any of claims 1 -4, wherein the reconstruction loss indicates a dissimilarity between the third facial expression image and the reconstructed image, and the discriminator loss indicates a cost of generating incorrect predictions that the reconstructed image has the appearance of the third facial expression image.

6. The computer-implemented method of any of claims 1-4, further comprising iteratively generating the first image at different levels of fatigue according to a difference between the first image and the second image at different time frames until a total value of the reconstructed loss and discriminator loss satisfy a predetermined criteria.

7. The computer-implemented method of claim 1 , wherein generating the multiple intermediate images further comprises:

predicting an intermediate image between the first image and the second image during the corresponding optical flow; and

interpolating the first image and the second image to generate the corresponding optical flow in which to generate the fake fatigued-state video of the driver.

8. The computer-implemented method of claim 1 , wherein generating the multiple intermediate images further comprises:

receiving a sequence of intermediate images arranged in an input order;

processing the sequence of intermediate images using an encoder to convert the sequence of intermediate images into an alternative representation of the sequence of intermediate images; and

processing the alternative representation of the sequence of intermediate images using a decoder to generate a target sequence of the sequence of intermediate images, the target sequence including multiple outputs arranged according to an output order.

9. The computer-implemented method of claim 1 , wherein the first representation maps the multiple second facial expression images to the first representation through a learned distribution.

10. The computer-implemented method of claim 1 , wherein the second representation maps the third facial expression image to the second representation through a learned distribution.

11. A device for training an application to recognize driver fatigue, comprising:

a non-transitory memory storage comprising instructions; and

one or more processors in communication with the memory, wherein the one or more processors execute the instructions to:

generate multiple first facial expression images from multiple second facial expression images using a first neural network, wherein the multiple first facial expression images are reconstructed from a first representation of the multiple second facial expression images learned from the first neural network;

generate a first image, expressing a current level of fatigue, from a third facial expression image and a second image, expressing a level of fatigue preceding the current level of fatigue, based on the first representation using a second neural network, wherein the first and second images are reconstructed from the first representation and a second representation of the third facial expression image learned from the second neural network;

generate multiple intermediate images of interpolated video data from the first and second images during a corresponding optical flow, where the optical flow is formed by fusing the first and second images and is located in a time frame between the first and second images; and

compile a fake fatigued-state video of a driver using at least the first and second images and the multiple intermediate images of the interpolated video data in which to train the application to detect the driver fatigue.

12. The device of claim 11 , wherein the first neural network performs the steps of:

mapping the corresponding first representation to the multiple first facial expression images having a same expression as the multiple second facial expression images.

13. The device of claim 11 , wherein the second neural network comprises a conditional variational auto-encoder that performs the steps of:

14. The device of any of claims 11 -13, where the second neural network further comprises a generative adversarial network that performs the steps of:

15. The device of any of claims 11 -14, wherein the reconstruction loss indicates a dissimilarity between the third facial expression image and the reconstructed image, and the discriminator loss indicates a cost of generating incorrect predictions that the reconstructed image has the appearance of the third facial expression image.

16. The device of any of claims 11 -14, wherein the one or more processors execute the instructions to iteratively generate the first image at different levels of fatigue according to a difference between the first image and the second image at different time frames until a total value of the reconstructed loss and discriminator loss satisfy a predetermined criteria.

17. The device of claim 11 , wherein generating the multiple intermediate images further comprises:

18. The device of claim 11 , wherein generating the multiple intermediate images further comprises:

receiving a sequence of intermediate images arranged in an input order;

19. The device of claim 11 , wherein the first representation maps the multiple second facial expression images to the first representation through a learned distribution.

20. The device of claim 11 , wherein the second representation maps the third facial expression image to the second representation through a learned distribution.

21. A non-transitory computer-readable medium storing computer instructions for training an application to recognize driver fatigue, that when executed by one or more processors, cause the one or more processors to perform the steps of:

22. The non-transitory computer-readable medium of claim 21 , wherein the first neural network performs the steps of:

23. The non-transitory computer-readable medium of claim 21 , wherein the second neural network comprises a conditional variational auto-encoder that performs the steps of: encoding the third facial expression image and the second image and outputting parameters describing a distribution for each dimension of the second representation; and

24. The non-transitory computer-readable medium of any of claims 21 -23, where the second neural network further comprises a generative adversarial network that performs the steps of:

25. The non-transitory computer-readable medium of any of claims 21 -24, wherein the reconstruction loss indicates a dissimilarity between the third facial expression image and the reconstructed image, and the discriminator loss indicates a cost of generating incorrect predictions that the reconstructed image has the appearance of the third facial expression image.

26. The non-transitory computer-readable medium of any of claims 21 -24, further comprising iteratively generating the first image at different levels of fatigue according to a difference between the first image and the second image at different time frames until a total value of the reconstructed loss and discriminator loss satisfy a predetermined criteria.

27. The non-transitory computer-readable medium of any of claims 21 , wherein generating the multiple intermediate images further comprises:

28. The non-transitory computer-readable medium of any of claims 21 , wherein generating the multiple intermediate images further comprises:

receiving a sequence of intermediate images arranged in an input order;

29. The non-transitory computer-readable medium of any of claims 21 , wherein the first representation maps the multiple second facial expression images to the first representation through a learned distribution.

30. The non-transitory computer-readable medium of any of claims 21 , wherein the second representation maps the third facial expression image to the second representation through a learned distribution.