EP4042318A1 - System und verfahren zur erzeugung eines videodatensatzes mit variierenden ermüdungsstufen durch übertragungslernen - Google Patents

System und verfahren zur erzeugung eines videodatensatzes mit variierenden ermüdungsstufen durch übertragungslernen

Info

Publication number
EP4042318A1
EP4042318A1 EP19828454.9A EP19828454A EP4042318A1 EP 4042318 A1 EP4042318 A1 EP 4042318A1 EP 19828454 A EP19828454 A EP 19828454A EP 4042318 A1 EP4042318 A1 EP 4042318A1
Authority
EP
European Patent Office
Prior art keywords
image
images
facial expression
representation
reconstructed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP19828454.9A
Other languages
English (en)
French (fr)
Inventor
Chengcheng JIA
Lei Yang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of EP4042318A1 publication Critical patent/EP4042318A1/de
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • G06V20/597Recognising the driver's state or behaviour, e.g. attention or drowsiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/167Detection; Localisation; Normalisation using comparisons between temporally consecutive images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Definitions

  • the disclosure generally relates to detection of driver fatigue, and in particular, to generate a video dataset to train an application for use to recognize when a driver is tired.
  • Driver fatigue or drowsiness is increasingly becoming a frequent cause of vehicular accidents.
  • Driver detection and monitoring of drowsiness is critical in assuring a safe driving environment not only for the drowsy driver, but also for other drivers in the vicinity that may be affected by the drowsy driver.
  • Vehicles with the ability to monitor a driver allow for measures to be taken by the vehicle to prevent or assist in preventing accidents as a result of the driver being drowsy. For instance, warning systems can be enabled to alert the driver that she is drowsy or automatic features, such as braking and steering, may be enabled to bring the vehicle under control until such time the driver is no longer tired.
  • driver detection and monitoring system may over-respond or under-respond, which may end up increasing the safety of drivers.
  • a computer-implemented method for training an application to recognize driver fatigue generating multiple first facial expression images from multiple second facial expression images using a first neural network, wherein the multiple first facial expression images are reconstructed from a first representation of the multiple second facial expression images learned from the first neural network; generating a first image, expressing a current level of fatigue, from a third facial expression image and a second image, expressing a level of fatigue preceding the current level of fatigue, based on the first representation using a second neural network, wherein the first and second images are reconstructed from the first representation and a second representation of the third facial expression image learned from the second neural network; generating multiple intermediate images of interpolated video data from the first and second images during a corresponding optical flow, where the optical flow is formed by fusing the first and second images and is located in a time frame between the first and second images; and compiling a fake fatigued-state video of a driver using at least the first and second images and the multiple intermediate
  • the first neural network performs the steps of mapping the multiple second facial expression images to a corresponding first representation; and mapping the corresponding first representation to the multiple first facial expression images having a same expression as the multiple second facial expression images.
  • the second neural network comprises a conditional variational auto-encoder that performs the steps of encoding the third facial expression image and the second image and outputting parameters describing a distribution for each dimension of the second representation; and decoding the distribution for each dimension of the second representation by calculating the relationship of each parameter with respect to an output loss to reconstruct the third facial expression image and the second image.
  • the second neural network further comprises a generative adversarial network that performs the steps of comparing the reconstructed image to the third facial expression image to generate a discriminator loss; comparing the reconstructed image to a ground truth image at a same level to generate a reconstructed loss; predicting a likelihood that the reconstructed image has an appearance that corresponds to the third facial expression image based on the discriminator loss and the reconstructed loss; and outputting the reconstructed image as the first image, expressing a current level of fatigue, for input to the conditional variational auto-encoder as the second image, expressing a level of fatigue preceding the current level of fatigue, when the prediction classifies the first image as real.
  • the reconstruction loss indicates a dissimilarity between the third facial expression image and the reconstructed image
  • the discriminator loss indicates a cost of generating incorrect predictions that the reconstructed image has the appearance of the third facial expression image
  • the computer- implemented method further comprising iteratively generating the first image at different levels of fatigue according to a difference between the first image and the second image at different time frames until a total value of the reconstructed loss and discriminator loss satisfy a predetermined criteria.
  • generating the multiple intermediate images further comprises predicting an intermediate image between the first image and the second image during the corresponding optical flow; and interpolating the first image and the second image to generate the corresponding optical flow in which to generate the fake fatigued-state video of the driver.
  • generating the multiple intermediate images further comprises receiving a sequence of intermediate images arranged in an input order; processing the sequence of intermediate images using an encoder to convert the sequence of intermediate images into an alternative representation of the sequence of intermediate images; and processing the alternative representation of the sequence of intermediate images using a decoder to generate a target sequence of the sequence of intermediate images, the target sequence including multiple outputs arranged according to an output order.
  • the first representation maps the multiple second facial expression images to the first representation through a learned distribution.
  • the second representation maps the third facial expression image to the second representation through a learned distribution.
  • a device for training an application to recognize driver fatigue comprising a non-transitory memory storage comprising instructions; and one or more processors in communication with the memory, wherein the one or more processors execute the instructions to: generate multiple first facial expression images from multiple second facial expression images using a first neural network, wherein the multiple first facial expression images are reconstructed from a first representation of the multiple second facial expression images learned from the first neural network; generate a first image, expressing a current level of fatigue, from a third facial expression image and a second image, expressing a level of fatigue preceding the current level of fatigue, based on the first representation using a second neural network, wherein the first and second images are reconstructed from the first representation and a second representation of the third facial expression image learned from the second neural network; generate multiple intermediate images of interpolated video data from the first and second images during a corresponding optical flow, where the optical flow is formed by fusing the first and second images and is located in a time frame
  • a non-transitory computer-readable medium storing computer instructions for training an application to recognize driver fatigue, that when executed by one or more processors, cause the one or more processors to perform the steps of generating multiple first facial expression images from multiple second facial expression images using a first neural network, wherein the multiple first facial expression images are reconstructed from a first representation of the multiple second facial expression images learned from the first neural network; generating a first image, expressing a current level of fatigue, from a third facial expression image and a second image, expressing a level of fatigue preceding the current level of fatigue, based on the first representation using a second neural network, wherein the first and second images are reconstructed from the first representation and a second representation of the third facial expression image learned from the second neural network; generating multiple intermediate images of interpolated video data from the first and second images during a corresponding optical flow, where the optical flow is formed by fusing the first and second images and is located in a time frame between the first
  • FIG. 1 A illustrates a driver monitoring system according to an embodiment of the present technology.
  • FIG. 1 B illustrates a detailed example of the driver monitoring system in accordance with FIG. 1A.
  • FIG. 2. illustrates an example of an expression recognition network.
  • FIG. 3 illustrates an example facial fatigue level generator network.
  • FIG. 4A illustrates a video prediction and interpolation network.
  • FIG. 4B illustrates an example frame interpolation network in accordance with FIG. 4A.
  • FIG. 4C illustrates an example of the video prediction and interpolation network of FIG. 4A with an expanded view of the LSTM auto-encoder.
  • FIGS. 5A - 5D illustrate example flow diagrams in accordance with embodiments of the present technology.
  • FIG. 6 illustrates a computing system upon embodiments of the disclosure may be implemented.
  • the technology relates to detection of driver fatigue, also known as driver drowsiness, tiredness and sleepiness, for a specific driver using an application trained from a fake fatigue-state video dataset.
  • Traditional datasets used to train applications to detect driver fatigue are typically based on public datasets that are not specific to individual drivers. Oftentimes, this results in the application detecting driver fatigue when none exists, or failing to detect driver fatigue when it does.
  • the disclosed technology generates personalized fake fatigue-state video datasets that is associated with a specific or individual driver. The datasets are generated by interpolating a sequence of images and predicting a next frame or sequence of images using various machine learning techniques and neural networks.
  • FIG. 1A illustrates a driver distraction system according to an embodiment of the present technology.
  • the driver distraction system 106 is shown as being installed or otherwise included within a vehicle 101 that also includes a cabin within which a driver 102 can sit.
  • the driver distraction system 106, or one or more portions thereof, can be implemented by an in-cabin computer system, and/or by a mobile computing device, such as, but not limited to, a smartphone, tablet computer, notebook computer, laptop computer, and/or the like.
  • the driver fatigue system 106 obtains (or collects), from one or more sensors, current data for a driver 102 of a vehicle 101. In other embodiments, the driver fatigue system 106 also obtains (or collects), from one or more databases 140, additional information about the driver 102 as it relates to features of the driver, such as facial features, historical head pose and eye gaze information, etc. The driver fatigue system 106 analyzes the current data and/or the additional information for the driver 102 of the vehicle 101 to thereby identify a driver’s head pose and eye gaze. In one embodiment, the driver fatigue system 106 additionally monitors and collects vehicle data and scene information, as described below. Such analysis may be performed using one or more computer implemented neural networks and/or some other computer implemented model, as explained below.
  • the driver fatigue system 106 is communicatively coupled to a capture device 103, which may be used to obtain current data for the driver of the vehicle 101 along with the vehicle data and scene information.
  • the capture device 103 includes sensors and other devices that are used to obtain current data for the driver 102 of the vehicle 101.
  • the captured data may be processed by processor(s) 104, which includes hardware and/or software to detect and track driver movement, head pose and gaze direction.
  • the capture device may additionally include one or more cameras, microphones or other sensors to capture data.
  • the capture device 103 may capture a forward facing scene of the route (e.g., the surrounding environment and/or scene information) on which the vehicle is traveling.
  • Forward facing sensors may include, for example, radar sensors, laser sensors, lidar sensors, optical imaging sensors, etc. It is appreciated that the sensors may also cover the sides, rear and top (upward and downward facing) of the vehicle 101.
  • the capture device 103 can be external to the driver fatigue system 106, as shown in FIG. 1A, or can be included as part of the driver fatigue system 106, depending upon the specific implementation. Additional details of the driver fatigue system 106, according to certain embodiments of the present technology, are described below with reference to FIG. 1 B.
  • the driver fatigue system 106 is also shown as being communicatively coupled to various different types of vehicle related sensors 105 that are included within the vehicle 101.
  • vehicle related sensors 105 can include, but are not limited to, a speedometer, a global positioning system (GPS) receiver, and a clock.
  • the driver fatigue system 106 is also shown as being communicatively coupled to one or more communication network(s) 130 that provide access to one or more database(s) 140 and/or other types of data stores.
  • the database(s) 140 and/or other types of data stores can store vehicle data for the vehicle 101. Examples of such data include, but are not limited to, driving record data, driving performance data, driving license type data, driver facial features, drive head pose, driver gaze, etc.
  • Such data can be stored within a local database or other data store that is located within the vehicle 101. However, the data is likely stored in one or more database(s) 140 or other data store(s) remotely located relative to the vehicle 101. Accordingly, such database(s) 140 or other data store(s) can be communicatively coupled to the driver distraction system via one or more communication networks(s) 130.
  • the communication network(s) 130 can include a data network, a wireless network, a telephony network, or any combination thereof. It is contemplated that the data network may be any local area network (LAN), metropolitan area network (MAN), wide area network (WAN), a public data network (e.g., the Internet), short range wireless network, or any other suitable packet-switched network.
  • LAN local area network
  • MAN metropolitan area network
  • WAN wide area network
  • public data network e.g., the Internet
  • short range wireless network e.g., the Internet
  • the wireless network may be, for example, a cellular network and may employ various technologies including enhanced data rates for global evolution (EDGE), general packet radio service (GPRS), global system for mobile communications (GSM), Internet protocol multimedia subsystem (IMS), universal mobile telecommunications system (UMTS), etc., as well as any other suitable wireless medium, e.g., worldwide interoperability for microwave access (WiMAX), Long Term Evolution (LTE) networks, code division multiple access (CDMA), wideband code division multiple access (WCDMA), wireless fidelity (Wi-Fi), wireless LAN (WLAN), Bluetooth®, Internet Protocol (IP) data casting, satellite, mobile ad-hoc network (MANET), and the like, or any combination thereof.
  • the communication network(s) 130 can provide communication capabilities between the driver distraction system 106 and the database(s) 140 and/or other data stores, for example, via communication device 102 (FIG. 1 B).
  • FIG. 1A While the embodiments of FIG. 1A are described with reference to a vehicle 101 , it is appreciated that the disclosed technology may be employed in a wide range of technological areas and is not limited to vehicles. For example, in addition to vehicles, the disclosed technology could be used in virtual or augmented reality devices or in simulators in which head pose and gaze estimations, vehicle data and/or scene information may be required. [0035] Additional details of the driver fatigue system 106, according to certain embodiments of the present technology, will now be described with reference to FIG. 1 B.
  • the driver fatigue system 106 includes a capture device 103, one or more processors 108, a vehicle system 104, a machine learning engine 109, an input/output (I/O) interface 114, a memory 116, a visual/audio alert 118, a communication device 120 and database 140 (which may also be part of the driver fatigue system).
  • a capture device 103 one or more processors 108, a vehicle system 104, a machine learning engine 109, an input/output (I/O) interface 114, a memory 116, a visual/audio alert 118, a communication device 120 and database 140 (which may also be part of the driver fatigue system).
  • the capture device 103 may be responsible for monitoring and identifying driver behaviors (including fatigue) based on captured driver motion and/or audio data using one or more capturing devices positioned within the cab, such as sensor 103A, camera 103B or microphone 103C.
  • the capture device 103 is positioned to capture motion of the driver's head and face, while in other implementations movement of the driver's torso, and/or driver's limbs and hands are also captured.
  • the detection and tracking 108A, head pose estimator 108B and gaze direction estimator 108C can monitor driver motion captured by capture device 103 to detect specific poses, such as head pose, or whether the person is looking in a specific direction.
  • Still other embodiments include capturing audio data, via microphone 103C, along with or separate from the driver movement data.
  • the captured audio may be, for example, an audio signal of the driver 102 captured by microphone 103C.
  • the audio can be analyzed to detect various features that may vary in dependence on the state of the driver. Examples of such audio features include driver speech, passenger speech, music, etc.
  • each component e.g., sensor, camera, microphone, etc.
  • each component may be a separate component located in different areas of the vehicle 101.
  • the sensor 103A, the camera 103B, the microphone 103C and the depth sensor 103D may each be located in a different area of the vehicle’s cab.
  • individual components of the capture deice 103 may be part of another component or device.
  • camera 103B and visual/audio 118 may be part of a mobile phone or tablet (not shown) placed in the vehicle’s cab, whereas sensor 103A and microphone 103C may be individually located in a different place in the vehicle’s cab.
  • the detection and tracking 108A monitors facial features of the driver
  • facial features includes, but is not limited to, points (or facial landmarks) surrounding eyes, nose, and mouth regions as well as points outlining contoured portions of the detected face of the driver 102.
  • initial locations for one or more eye features of an eyeball of the driver 102 can be detected.
  • the eye features may include an iris and first and second eye corners of the eyeball.
  • detecting the location for each of the one or more eye features includes detecting a location of an iris, detecting a location for the first eye corner and detecting a location for a second eye corner.
  • the head pose estimator 108B uses the monitored facial features to estimate a head pose of the driver 102.
  • the term “head pose” describes an angle referring to the relative orientation of the driver's head with respect to a plane of the capture device 103.
  • the head pose includes yaw and pitch angles of the driver's head in relation to the capture device plane.
  • the head pose includes yaw, pitch and roll angles of the driver's head in relation to the capture device plane.
  • the gaze direction estimator 108C estimates the driver's gaze direction (and gaze angle).
  • the capture device 103 may capture an image or group of images (e.g., of a driver of the vehicle).
  • the capture device 103 may transmit the image(s) to the gaze direction estimator 108C, where the gaze direction estimator 108C detects facial features from the images and tracks (e.g., over time) the gaze of the driver.
  • One such gaze direction estimator is the eye tracking system by Smart Eye Ab®.
  • the gaze direction estimator 108C may detect eyes from a captured image. For example, the gaze direction estimator 108C may rely on the eye center to determine gaze direction. In short, the driver may be assumed to be gazing forward relative to the orientation of his or her head. In some embodiments, the gaze direction estimator 108C provides more precise gaze tracking by detecting pupil or iris positions or using a geometric model based on the estimated head pose and the detected locations for each of the iris and the first and second eye corners. Pupil and/or iris tracking enables the gaze direction estimator 108C to detect gaze direction de-coupled from head pose.
  • Drivers often visually scan the surrounding environment with little or no head movement (e.g., glancing to the left or right (or up or down) to better see items or objects outside of their direct line of sight). These visual scans frequently occur with regard to objects on or near the road (e.g., to view road signs, pedestrians near the road, etc.) and with regard to objects in the cabin of the vehicle (e.g., to view console readings such as speed, to operate a radio or other in-dash devices, or to view/operate personal mobile devices). In some instances, a driver may glance at some or all of these objects (e.g., out of the corner of his or her eye) with minimal head movement. By tracking the pupils and/or iris, the gaze direction estimator 108C may detect upward, downward, and sideways glances that would otherwise go undetected in a system that simply tracks head position.
  • the gaze direction estimator 108C may detect upward, downward, and sideways glances that would otherwise go undetected in a system that simply tracks head position.
  • the gaze direction estimator 108C may cause the processor(s) 108 to determine a gaze direction (e.g., for a gaze of an operator at the vehicle).
  • the gaze direction estimator 108C receives a series of images (and/or video).
  • the gaze direction estimator 108C may detect facial features in multiple images (e.g., a series or sequence of images). Accordingly, the gaze direction estimator 108C may track gaze direction over time and store such information, for example, in database 140.
  • the processor 108 in addition to the afore-mentioned pose and gaze detection, may also include an image corrector 108D, video enhancer 108E, video scene analyzer 108F and/or other data processing and analytics to determine scene information captured by capture device 103.
  • Image corrector 108D receives captured data and may undergo correction, such as video stabilization. For example, bumps on the roads may shake, blur, or distort the data. The image corrector may stabilize the images against horizontal and/or vertical shake, and/or may correct for panning, rotation, and/or zoom.
  • Video enhancer 108E may perform additional enhancement or processing in situations where there is poor lighting or high data compression. Video processing and enhancement may include, but are not limited to. gamma correction, de-hazing, and/or de-blurring. Other video processing enhancement algorithms may operate to reduce noise in the input of low lighting video followed by contrast enhancement techniques, such but not limited to, tone-mapping, histogram stretching and equalization, and gamma correction to recover visual Information in Sow lighting videos.
  • the video scene analyzer 108F may recognize the content of the video coming in from the capture device 103.
  • the content of the video may include a scene or sequence of scenes from a forward facing camera 103B in the vehicle.
  • Analysis of the video may involve a variety of techniques, including but not limited to, low-level content analysis such as feature extraction, structure analysis, object detection, and tracking, to high-level semantic analysis such as scene analysis, event detection, and video mining.
  • low-level content analysis such as feature extraction, structure analysis, object detection, and tracking
  • high-level semantic analysis such as scene analysis, event detection, and video mining.
  • by recognizing the content of the incoming video signals it may be determined if the vehicie 101 is driving along a freeway or within city limits, if there are any pedestrians, animals, or other objects/obstacles on the road, etc.
  • the image data may be prepared in a manner that is specific to the type of analysis being performed. For example, image correction to reduce blur may allow video scene analysis to be performed more accurately by clearing up the appearance of edge lines used for object recognition.
  • Vehicie system 104 may provide a signal corresponding to any status of the vehicie, the vehicle surroundings, or the output of any other information source connected to the vehicle.
  • Vehicle data outputs may include, for example, analog signals (such as current velocity), digital signals provided by individual information sources (such as clocks, thermometers, location sensors such as Global Positioning System [GPS] sensors, etc.), digital signals propagated through vehicle data networks (such as an engine controller area network (CAN) bus through which engine related information may be communicated, a climate control CAN bus through which climate control related information may be communicated, and a multimedia data network through which multimedia data is communicated between multimedia components in the vehicle).
  • the vehicle system 104 may retrieve from the engine CAN bus the current speed of the vehicle estimated by the wheel sensors, a power state of the vehicle via a battery and/or power distribution system of the vehicle, an ignition state of the vehicle, etc.
  • Input/output interface(s) 114 allow information to be presented to the user and/or other components or devices using various input/output devices.
  • input devices include a keyboard, a microphone, touch functionality (e.g., capacitive or other sensors that are configured to detect physical touch), a camera (e.g., which may employ visible or non-visible wavelengths such as infrared frequencies to recognize movement as gestures that do not involve touch), and so forth.
  • Examples of output devices include a visual/audio alert 118, such as a display, speakers, and so forth.
  • I/O interface 114 receives the driver motion data and/or audio data of the driver 102 from the capturing device 103.
  • the driver motion data may be related to, for example, the eyes and face of the driver 102, which may be analyzed by processor(s) 108.
  • Data collected by the driver fatigue system 106 may be stored in database 140, in memory 116 or any combination thereof.
  • the data collected is from one or more sources external to the vehicle 101.
  • the stored information may be data related to driver distraction and safety, such as information captured by capture device 103.
  • the data stored in database 140 may be a collection of data collected for one or more drivers of vehicle 101.
  • the collected data is head pose data for a driver of the vehicle 101.
  • the collected data is gaze direction data for a driver of the vehicle 101.
  • the collected data may also be used to generate datasets and information that may be used to train models for machine learning, such as machine learning engine 109.
  • memory 116 can store instructions executable by the processor(s) 108, a machine learning engine 109, and programs or applications (not shown) that are loadable and executable by processor(s) 108.
  • machine learning engine 109 comprises executable code stored in memory 116 that is executable by processor(s) 108 and selects one or more machine learning models stored in memory 116 (or database 140).
  • the machine models can be developed and trained using well known and conventional machine learning and deep learning techniques, such as implementation of a convolutional neural network (CNN), using for example datasets generated in accordance with embodiments found below.
  • CNN convolutional neural network
  • FIG. 2 illustrates an example of an expression recognition network.
  • the expression recognition network 202 receives arbitrary facial images 201 A, which may be captured using a capture device, such as camera 103, scanner, database of images, such as database 104, and the like.
  • the arbitrary facial images 201 A which may be arbitrary in nature, are processed by the expression recognition network 202, with an auto-encoding style network architecture, to output facial expression images 201 B.
  • the representation learned during auto-encoding learning will then be used to assist in forming a dataset to train machine models, such as those described above.
  • the machine models may then be used to generate a fake fatigued-state video of drivers, which may be used in conjunction with personalized data to train an application to detect driver fatigue (i.e. , drowsiness, sleepiness or tiredness) of specific drivers.
  • the input images arbitrary facial 201 A are arbitrary facial expressions, such as anger, fear or neutral images
  • the output facial expression images 201 B are facial expression or emotion images that have been classified into the categories or classes, such as disgust, sadness, joy or surprise.
  • the expression recognition network 202 generates facial expression images 201 B from input arbitrary facial images 201 A using a neural network, such as an auto-encoder (AE) or a conditional variational auto-encoder (CVAE).
  • the expression recognition network 202 effectively aims to learn a latent or learned representation (or code), i.e., learned representation z g , which generates an output expression 201 B from the arbitrary facial images 201 A.
  • learned representation z g i.e., learned representation z g
  • an arbitrary image of fear may generate a facial expression image of surprise using the learned representation z g .
  • Learning occurs in layers (e.g., encoder and decoder layers) attached to the learned representation z.
  • the input arbitrary facial image 201 A is input into a first layer (e.g., encoder 204).
  • the learned representation z g compresses (reduces) the size of the input arbitrary facial images 201 A.
  • Reconstruction of the input arbitrary facial images 201 A occurs in a second layer (e.g., decoder 206), which outputs the facial expression images 201 B that correspond to the input arbitrary facial image 201 A.
  • the expression recognition network 202 is trained to encode the input arbitrary facial images 201 A into a learned representation z g , such that the input arbitrary facial images 201 A can be reconstructed from the learned presentation z g .
  • the learning consists of minimizing a reconstruction error with respect to encoding and decoding such that
  • the learned representation z g may then be used in training additional machine models, as explained below with reference to FIG. 3.
  • FIG. 3 illustrates a facial fatigue level generator network.
  • the facial fatigue level generator network 302 includes a CVAE 304 and a generative adversarial network (GAN) 306.
  • the facial fatigue level generator network 302 receives content, such as a sequence of images or video, that is processed to identify whether the input content is“real” or“fake” content.
  • the CVAE 304 is coupled to receive the content that is processed to output a reconstructed version of the content.
  • the CVAE 304 receives a flow F t-1®t of facial expression images, where the flow F includes a frame of images from the (/-1 )th to the / th frame of images.
  • the flow F £.1®£ of facial expression images includes facial expression images from
  • the specific individual with a natural or neutral facial expression e.g., the specific individual facial expression shown in a normal or plain state of expression
  • a natural or neutral facial expression e.g., the specific individual facial expression shown in a normal or plain state of expression
  • the facial expression image is a facial fatigue image.
  • the CVAE 304 includes an encoder 304A and a decoder (/generator) 306A.
  • the encoder 304A receives the flow F t.1®i of facial expression images at the different levels Lo, Li-i to Li, and maps each of the facial expression images to a learned representation z; through a learned distribution P(z
  • x,c), where“c” is the category or class of the data and“x” is the image z z, + z g . That is, the flow of facial expression images are transformed into the learned representation z; (e.g., a feature vector), which may be thought of as a compressed representation of the input to the encoder 304A.
  • the encoder 304A is a convolutional neural network (CNN).
  • the decoder 306A serves to invert the output of the encoder 304A using the learned representation z; concatenated with the learned representation z g (FIG. 2), as shown.
  • the concatenated learned representation (z,+z g ) is then used to generate a reconstructed version of the input from the encoder 304A. This reconstruction of the input is referred to as reconstructed image at level LM.
  • the reconstructed image represents a facial expression image showing
  • the GAN 306 includes a generator (/decoder) 306A and discriminator
  • the GAN 306 is a CNN.
  • the generator 306A receives the concatenated learned representation (z,+z g ) as input and outputs the reconstructed image as explained above.
  • the GAN 306 also includes a discriminator 306B.
  • the discriminator 306B is coupled to receive the original content and the reconstructed content (e.g., ) from the generator 306A and
  • parameters of the discriminator 306B are configured to discriminate between training and reconstructed versions of content based on the differences between the two versions that arise during the encoding process. For example, the discriminator 306B receives as input reconstructed image (or natural facial
  • the discriminator To predict whether the reconstructed image s real or fake, the discriminator
  • LOSSGD may be calculated using the function:
  • L and LOSSEP may be calculated using the function: where D() is the discriminator, G() is the generator,
  • E[] is the expectation and z is the learned representation (or code).
  • the facial fatigue generation network 302 predicts whether the reconstructed images is real or fake using the loss functions, where a value of the minimized loss function for real images should be lower than the minimized loss function for fake images.
  • original content is assigned a label of 1 (real)
  • reconstructed content determined to be fake is assigned a label of 0 (fake).
  • the discriminator 306B may predict the input content to be a reconstructed (i.e., fake) version when the corresponding discrimination prediction is below a threshold value, and may be predict the input content to be real (real image ) when the corresponding prediction is above a threshold value.
  • the image is replaced with the real image and the image
  • the discriminator 306B outputs a
  • the generator and/or the discriminator can include various types of machine-learned models.
  • Machine-learned models can include linear models and non-linear models.
  • machine-learned models can include regression models, support vector machines, decision tree-based models, Bayesian models, and/or neural networks (e.g., deep neural networks).
  • Neural networks can include feed-forward neural networks, recurrent neural networks (e.g., long short-term memory recurrent neural networks), convolutional neural networks or other forms of neural networks.
  • the generator and discriminator are sometimes referred to as “networks”, they are not necessarily limited to being neural networks but can also include other forms of machine-learned models.
  • FIG. 4A illustrates a video prediction and interpolation network.
  • the video prediction and interpolation network 402 includes a frame interpolation network 402A and a long short term memory (LSTM) auto-encoder network 402B.
  • the LSTM effectively preserves motion trends (patterns) and transfers the motion trends to predicted frames, while the interpolation network generates intermediate images from broader frames. Thus, frames may be interpolated while the motion trend is maintained.
  • LSTM long short term memory
  • the frame interpolation network 402A is used to generate new frames from original frames of video content. In doing so, the network predicts one or more intermediate images at timesteps (or timestamps) defined between two consecutive frames.
  • a first neural network 410 approximates optical flow data defining motion between the two consecutive frames.
  • a second neural network 412 refines the optical flow data and predicts visibility maps for each timestep. The two consecutive frames are warped according to the refined optical flow data for each timestep to produce pairs of warped frames for each timestep. The second neural network then fuses the pair of warped frames based on the visibility maps to produce the intermediate images for each timestep. Artifacts caused by motion boundaries and occlusions are reduced in the predicted intermediate images.
  • frame interpolation network 402A is provided with two input images, such as images at a time t Î (t-1,t), an intermediate (or
  • interpolated image can be predicted.
  • a CNN may be leveraged to compute the optical flow.
  • a CNN can be trained, using the two input images to jointly predict the
  • the frame interpolation network 402A may receive images where the forward optical flow F t+1®t and the
  • the frame interpolation network processes the input images and outputs intermediate images
  • the LSTM auto-encoder network 402B learns representations of image sequences.
  • the LSTM auto-encoder network 402B uses recurrent neural nets (RNNs) made of LSTM units or memory blocks to perform learning.
  • RNNs recurrent neural nets
  • a first RNN is an encoder that maps an input image sequence 404 (e.g., a sequence of image frames) into a fixed length representation, which is then decoded using a second RNN, such as a decoder.
  • the input image sequence 404 such as images is processed by the encoder in the LSTM auto-encoder network 402B to generate the representation for the input image sequence 404.
  • the learned representation generated using the input sequence is then processed using the decoder in the LSTM auto-encoder network 402B.
  • the decoder outputs a prediction for a generated target sequence for the input sequence.
  • the target sequence is the same as the input sequence in reverse order.
  • the decoder in the LSTM auto- encoder network 402B includes one or more LSTM layers and is configured to receive a current output in the target sequence so as to generate a respective output score.
  • the output score for a given output represents the likelihood that the output is the next output in the target sequence, i.e., predicts whether the output represents the next output in the target sequence.
  • the decoder also updates the hidden state of the network to generate an updated hidden state.
  • FIG. 4B illustrates an example frame interpolation network in accordance with FIG. 4A.
  • the frame interpolation network 402A includes encoder 410 and decoder 412 that fuses warped input images to generate the intermediate image More specifically, the two input images are warped to the
  • a flow computation CNN is used to estimate the bi-direction optical flow between the two input images and a flow interpolation CNN is used to refine the flow approximations and predict visibility maps. The visibility maps may then be applied to the two warped images prior to fusing, such that artifacts in the interpolated intermediate image are reduced.
  • the flow computation CNN and the flow interpolation CNN is a U-Net architecture as described in“U-net: Convolutional networks for biomedical image segmentation,” MIC-CAI, 2015. [0072]
  • each of the input images and the frame interpolation network 402A includes a flow interpolation network
  • the encoder 410 receives a sequential image pair at timestamps (t, t+1). Bi-directional optical flows are computed based on the sequential image pair The bi-
  • directional optical flows are linearly combined to approximate intermediate bi- directional optical flows for at least one timestep t between the two input images in the sequential image pair.
  • Each of the input images are warped (backward) according to the approximated intermediate bi-directional optical flows for each timestep to produce warped input frames l i®t and l i+1®t .
  • the decoder 412 includes a flow refinement network (not shown) that corresponds to each warping unit and an image predictor (not shown) to predict the intermediate image l tn at a time t e (i,i+1 ).
  • the intermediate bi-directional optical flows are refined for each timestep
  • the refined intermediate bi-directional optical flows (F t i, , F t i+1 ) are output and processed by an image prediction unit to produce the intermediate image
  • the image prediction unit receives the warped input frames l t i and I t i+1 generated by the optical flow warping units, and the warped input frames are linearly fused by the image prediction unit to produce the intermediate image for each timestep.
  • visibility maps are applied to the two warped images.
  • a flow refinement network in the decoder 412 predicts visibility maps V t i and V t i+1 for each timestep. Since visibility maps are used, when a pixel is visible in both the decoder 412 learns to
  • the visibility maps are applied to the warped images before the warped images are linearly fused so as to produce the intermediate image for each timestep.
  • the intermediate images are synthesized according to:
  • FIG. 4C illustrates an example of the video prediction and interpolation network of FIG. 4A with an expanded view of the LSTM auto-encoder.
  • the basic building block of the LSTM auto-encoder 402B is a LSTM memory cell, represented by RNN.
  • Each LSTM memory cell has a state at time t.
  • each LSTM memory block may include one or more cells.
  • Each cell includes an input gate, a forget gate, and an output gate that allow the cell to store previous activations generated by the cell, e.g., as a hidden state for use in generating a current activation or to be provided to other components of the LSTM auto-encoder 402B.
  • these LSTM memory blocks form the RNNs in which to perform learning.
  • encoder 403 consists of multilayered
  • Each of the encoders 403 receives a single element of the input image sequence along with a corresponding intermediate image generated by the frame interpolation network 402A - 402N, respectively.
  • the input sequence which is a collection of images collected by the
  • the current hidden state is updated, i.e. , to modify the current hidden state that has been generated by processing previous inputs from the input sequence by processing the current received input.
  • a respective weight wi may then be applied to the previously hidden state and the input vector.
  • the learned representation of the input sequence is then processed using decoder 405 to generate the target sequence for the input sequence.
  • the decoder 405 also includes multilayered RNNs, where the arrows show the direction of information flow, which predicts an output at timestep t.
  • Each RNN accepts a hidden state from the previous element and produces and outputs its own hidden state.
  • the outputs are calculated using the hidden state at the current timestep together with a respective weight w ⁇ .
  • the final output may be determined using a probability vector using, for example, Softmax or some other known function.
  • FIGS. 5A - 5D illustrate example flow diagrams in accordance with embodiments of the present technology.
  • the flow diagrams may be computer-implemented methods performed, at least partly, by hardware and/or software components illustrated in the various figures and as described herein.
  • the disclosed process may be performed by the driver fatigue system 106 disclosed in FIGS. 1A and 1 B.
  • FIG. 5A illustrates a flow diagram of compilation of a fake fatigued- state video dataset.
  • the dataset which is generated using transfer learning techniques, may be used to train an application to recognize driver fatigue.
  • a neural network such as an AE or CVAE, generates facial expression images learned from arbitrary facial images.
  • the facial expression images are reconstructed from a learned representation of the arbitrary facial images learned from the neural network.
  • the learned representation may then be applied during a training stage of a fatigue level generator network.
  • step 504 another neural network is trained using a neutral, natural or normal facial image in which little to no expression is visible, and a flow of images expressing varying levels of fatigue.
  • a neutral facial image and flow of images an image expressing a current level of fatigue is generated.
  • the flow of images changes such that the current image becomes the preceding image and a new current image is generated.
  • an image expressing a current level of fatigue is generated from the neutral facial image and the image expressing a level of fatigue preceding the current level of fatigue based on the learned representation.
  • the image expressing a current level of fatigue (or the reconstructed image) is reconstructed from the representation learned in step 502 and a second representation of the neutral facial image learned from the neural network.
  • the reconstructed image may then be compared to a ground truth model to determine whether the reconstructed image is real or fake, as discussed below.
  • Intermediate images of interpolated video data (sequential image data) from the reconstructed facial expression and arbitrary facial images during a corresponding optical flow are generated at step 506.
  • the optical flow is formed by fusing the reconstructed facial expression and arbitrary facial images and is located in a time frame between the reconstructed facial expression and arbitrary facial images.
  • a fake fatigued-state video (i.e. , the dataset) of a driver is compiled using at least the reconstructed facial expression and arbitrary facial images and the intermediate images of the interpolated video data in which to train the application to detect driver fatigue.
  • FIG. 5B an example flow diagram of a neural network, such as CVAE 304, as illustrated.
  • the CVAE includes an encoder and a decoder.
  • the facial expression image, including the level, and the neutral facial image are encoded and parameters describing a distribution for each dimension of the learned representation are output.
  • the distribution for each dimension of the learned representation is decoded and the relationship of each parameter with respect to an output loss is calculated to reconstruct the neutral facial image and the facial expression image as a reconstructed image, at step 512.
  • the reconstructed image is compared to the neutral image to generate a discriminator loss, and the reconstructed image is compared to a ground truth at a same level to generate a reconstructed loss, at step 516. Based on the discriminator loss and the reconstructed loss, a predication is made as to the likelihood that the reconstructed image has an appearance that corresponds to the neutral image, at step 518.
  • the reconstructed image is output as the real image and backward propagated to the input of the CVAE for the next iteration at step 520.
  • the neural network in step 502 maps the arbitrary facial images to a corresponding learned representation at 522, and the learned representation is mapped to the facial expression images with a same shape or same image size as the arbitrary facial images (e.g., the reconstructed image has the same number of columns and rows as the arbitrary image), at step 524.
  • step 526 an intermediate image between the facial expression image and the arbitrary facial image is predicted during the corresponding optical flow.
  • the images are interpolated to generate the corresponding optical flow in which to generate the fake fatigued-state video of the driver, at step 528.
  • step 530 a sequence of intermediate images are arranged in an input order, and the sequence of intermediate images are processed using an encoder to convert the sequence of intermediate images into an alternative representation, at step 532.
  • step 534 the alternative representation of the sequence of intermediate images is processed using a decoder to generate a target sequence of the sequence of intermediate images, where the target sequence includes multiple outputs arranged according to an output order.
  • FIG. 6 illustrates a computing system upon embodiments of the disclosure may be implemented.
  • Computing system 600 may be programmed (e.g., via computer program code or instructions) to provide enhanced safety to drivers using driver fatigue (tiredness) detection as described herein and includes a communication mechanism such as a bus 610 for passing information between other internal and external components of the computer system 600.
  • the computer system 600 is system 106 of FIG. 1 B.
  • Computer system 600, or a portion thereof, constitutes a means for performing one or more steps for providing enhanced safety to drivers using the driver distraction (including driver fatigue) detection.
  • a bus 610 includes one or more parallel conductors of information so that information is transferred quickly among devices coupled to the bus 610.
  • One or more processors 602 for processing information are coupled with the bus 610.
  • One or more processors 602 performs a set of operations on information (or data) as specified by computer program code related to for provide enhanced safety to drivers using driver distraction detection.
  • the computer program code is a set of instructions or statements providing instructions for the operation of the processor and/or the computer system to perform specified functions.
  • the code for example, may be written in a computer programming language that is compiled into a native instruction set of the processor. The code may also be written directly using the native instruction set (e.g., machine language).
  • the set of operations include bringing information in from the bus 610 and placing information on the bus 610.
  • Each operation of the set of operations that can be performed by the processor is represented to the processor by information called instructions, such as an operation code of one or more digits.
  • a sequence of operations to be executed by the processor 602, such as a sequence of operation codes constitute processor instructions, also called computer system instructions or, simply, computer instructions.
  • Computer system 600 also includes a memory 604 coupled to bus 610.
  • the memory 604 such as a random access memory (RAM) or any other dynamic storage device, stores information including processor instructions for providing enhanced safety to drivers using driver distraction detection. Dynamic memory allows information stored therein to be changed by the computer system 600. RAM allows a unit of information stored at a location called a memory address to be stored and retrieved independently of information at neighboring addresses.
  • the memory 604 is also used by the processor 602 to store temporary values during execution of processor instructions.
  • the computer system 600 also includes a read only memory (ROM) 606 or any other static storage device coupled to the bus 610 for storing static information. Also coupled to bus 610 is a non-volatile (persistent) storage device 608, such as a magnetic disk, optical disk or flash card, for storing information, including instructions.
  • ROM read only memory
  • non-volatile (persistent) storage device 608 such as a magnetic disk, optical disk or flash card, for storing information, including instructions.
  • information including instructions for providing enhanced safety to tired drivers using information processed by the aforementioned system and embodiments, is provided to the bus 610 for use by the processor from an external input device 612, such as a keyboard operated by a human user, a microphone, an Infrared (IR) remote control, a joystick, a game pad, a stylus pen, a touch screen, head mounted display or a sensor.
  • IR Infrared
  • a sensor detects conditions in its vicinity and transforms those detections into physical expression compatible with the measurable phenomenon used to represent information in computer system 600.
  • Other external devices coupled to bus 610 used primarily for interacting with humans, include a display device 614 for presenting text or images, and a pointing device 616, such as a mouse, a trackball, cursor direction keys, or a motion sensor, for controlling a position of a small cursor image presented on the display 614 and issuing commands associated with graphical elements presented on the display 614, and one or more camera sensors 684 for capturing, recording and causing to store one or more still and/or moving images (e.g., videos, movies, etc.) which also may comprise audio recordings.
  • a display device 614 for presenting text or images
  • a pointing device 616 such as a mouse, a trackball, cursor direction keys, or a motion sensor
  • camera sensors 684 for capturing, recording and causing to store one or more still and/or moving images (e.g., videos, movies, etc.) which also may comprise audio recordings.
  • special purpose hardware such as an application specific integrated circuit (ASIC) 620, is coupled to bus 610.
  • the special purpose hardware is configured to perform operations not performed by processor 602 quickly enough for special purposes.
  • Computer system 600 also includes a communications interface 670 coupled to bus 610.
  • Communication interface 670 provides a one-way or two-way communication coupling to a variety of external devices that operate with their own processors. In general the coupling is with a network link 678 that is connected to a local network 680 to which a variety of external devices, such as a server or database, may be connected. Alternatively, link 678 may connect directly to an Internet service provider (ISP) 684 or to network 690, such as the Internet.
  • ISP Internet service provider
  • the network link 678 may be wired or wireless.
  • communication interface 670 may be a parallel port or a serial port or a universal serial bus (USB) port on a personal computer.
  • USB universal serial bus
  • communications interface 670 is an integrated services digital network (ISDN) card or a digital subscriber line (DSL) card or a telephone modem that provides an information communication connection to a corresponding type of telephone line.
  • ISDN integrated services digital network
  • DSL digital subscriber line
  • a communication interface 670 is a cable modem that converts signals on bus 610 into signals for a communication connection over a coaxial cable or into optical signals for a communication connection over a fiber optic cable.
  • communications interface 670 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN, such as Ethernet. Wireless links may also be implemented.
  • LAN local area network
  • the communications interface 670 sends and/or receives electrical, acoustic or electromagnetic signals, including infrared and optical signals, which carry information streams, such as digital data.
  • the communications interface 670 includes a radio band electromagnetic transmitter and receiver called a radio transceiver.
  • the communications interface 670 enables connection to a communication network for providing enhanced safety to tired drivers using mobile devices, such as mobile phones or tablets.
  • Network link 678 typically provides information using transmission media through one or more networks to other devices that use or process the information.
  • network link 678 may provide a connection through local network 680 to a host computer 682 or to equipment 684 operated by an ISP.
  • ISP equipment 684 in turn provide data communication services through the public, world-wide packet- switching communication network of networks now commonly referred to as the Internet 690.
  • a computer called a server host 682 connected to the Internet hosts a process that provides a service in response to information received over the Internet.
  • server host 682 hosts a process that provides information representing video data for presentation at display 614.
  • the components of system 600 can be deployed in various configurations within other computer systems, e.g., host 682 and server 682.
  • At least some embodiments of the disclosure are related to the use of computer system 600 for implementing some or all of the techniques described herein. According to one embodiment of the disclosure, those techniques are performed by computer system 600 in response to processor 602 executing one or more sequences of one or more processor instructions contained in memory 604.
  • Such instructions also called computer instructions, software and program code, may be read into memory 604 from another computer-readable medium such as storage device 608 or network link 678. Execution of the sequences of instructions contained in memory 604 causes processor 602 to perform one or more of the method steps described herein.
  • the computer-readable non-transitory media includes all types of computer readable media, including magnetic storage media, optical storage media, and solid state storage media and specifically excludes signals.
  • the software can be installed in and sold with the device. Alternatively the software can be obtained and loaded into the device, including obtaining the software via a disc medium or from any manner of network or distribution system, including, for example, from a server owned by the software creator or from a server not owned but used by the software creator.
  • the software can be stored on a server for distribution over the Internet, for example.
  • Computer-readable storage media exclude (excludes) propagated signals per se, can be accessed by a computer and/or processor(s), and include volatile and non-volatile internal and/or external media that is removable and/or non-removable.
  • the various types of storage media accommodate the storage of data in any suitable digital format. It should be appreciated by those skilled in the art that other types of computer readable medium can be employed such as zip drives, solid state drives, magnetic tape, flash memory cards, flash drives, cartridges, and the like, for storing computer executable instructions for performing the novel methods (acts) of the disclosed architecture.
  • each process associated with the disclosed technology may be performed continuously and by one or more computing devices.
  • Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)
EP19828454.9A 2019-12-05 2019-12-05 System und verfahren zur erzeugung eines videodatensatzes mit variierenden ermüdungsstufen durch übertragungslernen Pending EP4042318A1 (de)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2019/064694 WO2020226696A1 (en) 2019-12-05 2019-12-05 System and method of generating a video dataset with varying fatigue levels by transfer learning

Publications (1)

Publication Number Publication Date
EP4042318A1 true EP4042318A1 (de) 2022-08-17

Family

ID=69024680

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19828454.9A Pending EP4042318A1 (de) 2019-12-05 2019-12-05 System und verfahren zur erzeugung eines videodatensatzes mit variierenden ermüdungsstufen durch übertragungslernen

Country Status (3)

Country Link
EP (1) EP4042318A1 (de)
CN (1) CN114303177A (de)
WO (1) WO2020226696A1 (de)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112686103B (zh) * 2020-12-17 2024-04-26 浙江省交通投资集团有限公司智慧交通研究分公司 一种车路协同的疲劳驾驶监测系统
CN112617835B (zh) * 2020-12-17 2022-12-13 南京邮电大学 一种基于迁移学习的多特征融合疲劳检测方法
CN112258428A (zh) * 2020-12-21 2021-01-22 四川圣点世纪科技有限公司 一种基于CycleGAN的指静脉增强方法及装置
CN112884030B (zh) * 2021-02-04 2022-05-06 重庆邮电大学 一种基于交叉重建的多视角分类系统及方法
EP4334884A1 (de) * 2021-05-05 2024-03-13 Seeing Machines Limited Systeme und verfahren zur erkennung der verwendung einer mobilen vorrichtung durch einen fahrzeugfahrer
CN113239834B (zh) * 2021-05-20 2022-07-15 中国科学技术大学 一种可预训练手模型感知表征的手语识别系统
US11922320B2 (en) 2021-06-09 2024-03-05 Ford Global Technologies, Llc Neural network for object detection and tracking
CN113542271B (zh) * 2021-07-14 2022-07-26 西安电子科技大学 基于生成对抗网络gan的网络背景流量生成方法
CN114403878B (zh) * 2022-01-20 2023-05-02 南通理工学院 一种基于深度学习的语音检测疲劳度方法
CN115439836B (zh) * 2022-11-09 2023-02-07 成都工业职业技术学院 一种基于计算机的健康驾驶辅助方法及系统
CN117975543A (zh) * 2024-04-01 2024-05-03 武汉烽火信息集成技术有限公司 一种基于光流表情的区块链零知识身份认证凭证交互方法

Also Published As

Publication number Publication date
WO2020226696A1 (en) 2020-11-12
CN114303177A (zh) 2022-04-08

Similar Documents

Publication Publication Date Title
EP4042318A1 (de) System und verfahren zur erzeugung eines videodatensatzes mit variierenden ermüdungsstufen durch übertragungslernen
JP7011578B2 (ja) 運転行動を監視する方法及びシステム
KR102470680B1 (ko) 동작 인식, 운전 동작 분석 방법 및 장치, 전자 기기
CN108725440B (zh) 前向碰撞控制方法和装置、电子设备、程序和介质
US10387725B2 (en) System and methodologies for occupant monitoring utilizing digital neuromorphic (NM) data and fovea tracking
Kopuklu et al. Driver anomaly detection: A dataset and contrastive learning approach
US11816585B2 (en) Machine learning models operating at different frequencies for autonomous vehicles
US20180121733A1 (en) Reducing computational overhead via predictions of subjective quality of automated image sequence processing
US11249557B2 (en) Methods and systems for controlling a device using hand gestures in multi-user environment
CN114026611A (zh) 使用热图检测驾驶员注意力
US9881221B2 (en) Method and system for estimating gaze direction of vehicle drivers
CN111566612A (zh) 基于姿势和视线的视觉数据采集系统
Rangesh et al. Driver gaze estimation in the real world: Overcoming the eyeglass challenge
US20220058407A1 (en) Neural Network For Head Pose And Gaze Estimation Using Photorealistic Synthetic Data
US11099396B2 (en) Depth map re-projection based on image and pose changes
US11367355B2 (en) Contextual event awareness via risk analysis and notification delivery system
US20230334907A1 (en) Emotion Detection
US20230098829A1 (en) Image Processing System for Extending a Range for Image Analytics
Qu et al. Comprehensive study of driver behavior monitoring systems using computer vision and machine learning techniques
US20220153278A1 (en) Cognitive Heat Map: A Model for Driver Situational Awareness
US20210279506A1 (en) Systems, methods, and devices for head pose determination
Shariff et al. Event Cameras in Automotive Sensing: A Review
US11176368B2 (en) Visually focused first-person neural network interpretation
US20240104686A1 (en) Low-Latency Video Matting
Narwal et al. Image Systems and Visualizations

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20220513

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS