CA2986160A1 - Training of vehicles to imporve autonomous capabilities - Google Patents

Training of vehicles to imporve autonomous capabilities Download PDF

Info

Publication number
CA2986160A1
CA2986160A1 CA2986160A CA2986160A CA2986160A1 CA 2986160 A1 CA2986160 A1 CA 2986160A1 CA 2986160 A CA2986160 A CA 2986160A CA 2986160 A CA2986160 A CA 2986160A CA 2986160 A1 CA2986160 A1 CA 2986160A1
Authority
CA
Canada
Prior art keywords
driver
eye
human
road
event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
CA2986160A
Other languages
French (fr)
Inventor
Ashok Krishnan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CA2986160A priority Critical patent/CA2986160A1/en
Publication of CA2986160A1 publication Critical patent/CA2986160A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B3/00Apparatus for testing the eyes; Instruments for examining the eyes
    • A61B3/10Objective types, i.e. instruments for examining the eyes independent of the patients' perceptions or reactions
    • A61B3/113Objective types, i.e. instruments for examining the eyes independent of the patients' perceptions or reactions for determining or recording eye movement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • G06V20/597Recognising the driver's state or behaviour, e.g. attention or drowsiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/19Sensors therefor
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/30Nc systems
    • G05B2219/35Nc in input of data, input till input file format
    • G05B2219/35503Eye tracking associated with head mounted display to detect eye position
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Ophthalmology & Optometry (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Veterinary Medicine (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Eye Examination Apparatus (AREA)

Abstract

Systems and methods to improve performance, reliability, learning and safety and thereby enhance autonomy of vehicles. Human sensors are used to capture human eye movement, hearing, hand grip and contact area, and foot positions. Event signatures corresponding to human actions, reactions and responses are extracted form these sensor values and correlated to events, status and situations acquired using vehicle and outside environment sensors. These event signatures are then used to train vehicles to improve their autonomous capabilities. Human sensors are vehicle mounted or frame mounted. Signatures obtained from events are classified and stored.

Description

TRAINING OF VEHICLES TO IMPROVE AUTONOMOUS CAPABILITIES
BACKGROUND
[01] Autonomous vehicles (AV) are expected to eventually replace the much of the traditional human operation of vehicles. The task of automation is greatly aided by exponential growth of computing capabilities, including hardware and software. Lidar, radar, infrared and ultrasound sources are being deployed in test vehicles to improve their autonomy.
However, vehicles that are truly fully autonomous have not yet been developed.
BRIEF SUMMARY
[02] With the progression of time, and as computing powers and artificial intelligence make inroads into automation, what was once a task performed using simple sensors has grown much more efficient and versatile. However, mimicking human driving behavior and improving upon existing systems would go a long way for large scale adoption of fully autonomous vehicles.
Current systems are limited in their abilities to operate in real-world scenarios. To perform more human-like tasks, it is essential to understand how humans drive, and to do this it is essential to capture events during a variety of scenarios when a human is operating a vehicle. Operating a vehicle is a complicated process, and vision is one of the key sensors for humans. Using vision sensing data helps automation become more human-like. Vision will become the primary differentiator from the previous generation of vehicles with autonomous functionality.
[03] Learning from human behavior by incorporating human sensors to capture human actions during driving is advantageous. In a first embodiment, human eye movements during driving are used to train vehicles to be more autonomous. Other human sensors gather data from binaural microphones, data related to hand (grip and contact area on steering wheel) and foot position, which are used to train vehicles to improve their autonomous functionality.
Eye movement is captured through cameras and smarthphones located on human mounted frames or the dashpad of the vehicle. Illumination sources include IR illuminators, light from phone screens, and ambient light. Hand grip and contact area on the steering wheel is captured using a sensor array that read contact points and their position as well as grip forces on each sensor.
[04] When examples and embodiments are described to be related to vehicles, these vehicles can include land, air, space, water vehicles, including wheeled, tracked, railed or skied vehicles.
Vehicles can be human-occupied or not, powered by any means, and can be used for conveyance, leisure, entertainment, exploration, mapping, recreation, rescue, delivery, fetching, and provision of services, messenger services, communication, transportation, mining, safety, or armed forces.
[05] Vehicles can be operated in a range of modes. They can be fully under the control of a human, which is the case for non-autonomous vehicles; they can be fully autonomous, without the need for human intervention, assistance or oversight; or a range of types in between, which can be broadly termed semi-autonomous. Non-autonomous vehicles require a human operator, whereas fully autonomous versions do not require a human operator. All examples of vehicles appearing in this disclosure are automatic drive, that is, they have no clutch pedal, just accelerator and brake pedals that are both operated by the same foot.
BRIEF DESCRIPTION OF DRAWINGS
[06] Fig 1 is an example of a prior art non-autonomous vehicle.
[07] Fig 2 is an example of a semi-autonomous vehicle.
[08] Fig. 3 is an example of a fully autonomous vehicle.
[09] Figs 4a, 4b show different views of a trial autonomous vehicle with traditional sensors.
[010] Figs 5a, Sal show parts of a human eye.
[011] Fig 5b shows the axes of the eye.
[012] Fig 5c shows different types of reflections from an eye.
[013] Fig 5d shows visual field of an eye.
[014] Fig 6 shows the image of an eye illuminated by an IR source.
[015] Figs 7a, 7b show a binaural recording dummy head
[016] Fig 7c shows the placement of microphones inside the dummy head-mounted
[017] Fig 7d shows a variation of the binaural device that is truncated above and below the ears.
[018] Figs 8a-8c shows examples of placement of binaural microphone devices.
[019] Figs 9a-9e show hand contact area and grip sensing devices and their details.
[020] Figs 10a-10h show foot position sensing concepts.
[021] Figs ha-c shows the inside of a car with various arrangements of cameras and IR
illumination sources.
[022] Figs 12a shows the inside of a car with a single phone camera.
[023] Figs 12b shows the image acquired by the setup of fig 12a.
[024] Figs 12c shows the inside of a car with two phone cameras.
[025] Figs 12d shows the inside of a car with a single phone camera aimed at a windshield mounted patch.
[026] Figs 12e shows the inside of a car with a single phone camera having illuminator patterns on its screen.
[027] Figs 12f shows the inside of a car with two phone cameras having illuminator patterns on their screens.
[028] Figs 13a-d show details of an embodiment of a phone camera imaging adapter.
[029] Figs 14a, 14a1, 14a2, 14b, 14b1, 14c, 14c1 show various arrangements of frame mounted eye and sound imaging systems.
[030] Fig 15a shows a scenario of eye movement data being used to train a vehicle to improve its autonomy. Fig 15b shows eye movement data of fig 15a.
[031] Fig 16 shows a scenario on a road with an approaching ambulance.
[032] Figs 17 shows a scenario on a road with a child at the edge of the road.
[033] Figs 18a-b show a scenario on a road with a long truck that is not slowing down.
[034] Fig 19a shows a scenario on a road with a maintenance vehicle on the road.
[035] Figs 19b, 19b1, 19b2, 19b3 show a scenario on a road with a child on a bicycle at the edge of the road.
[036] Fig 19c shows a scenario on a road with a ball rolling onto the road.
[037] Figs 19d1-19d3 shows a scenario on a road with a kangaroo crossing the road.
[038] Fig 19e shows a scenario on a road with a dog on a leash by the edge of the road.
[039] Fig 19f shows a scenario on a road with a dog on a stretched leash by the edge of the road.
[040] Figs 19e1, 19f1 show corresponding eye movements overlaid on figs 19e and 19f.
[041] Figs 19e1a, 19fla show eye movement overlay separately with added fixation details.
[042] Fig 20 shows a prior art autonomous vehicle control system.
[043] Fig 21 shows an autonomous vehicle control system incorporating human sensors.
[044] Fig 22 shows different types of sensors used in a data capturing vehicle.
[045] Fig 23 shows events recorded by human sensors.
[046] Fig 24 shows the event identifying module.
[047] Fig 25 shows a signature categorization scheme
[048] Fig 26a shows a human event occurrence detection scheme.
[049] Fig 26b shows an event extraction, categorization, map update and training software update scheme.
[050] Fig 1 is an example of a non-autonomous vehicle. Here, a human operator is driving a car. The steering wheel is on the right hand side (right hand drive- RHD), the traffic pattern is left hand traffic (LHT). The human driver has control of all functions, including steering, braking, acceleration, signaling (turn indicators, emergency indicator), lights (high and low beams), windshield wipers, vehicle atmospheric control (ventilation, heating, cooling, humidity control, defrosting). The car can have features like cruise control and anti-lock braking system (ABS), but these are not considered to contributing to vehicle autonomy.
[051] Fig 2 is an example of a semi-autonomous vehicle. Here, a human occupant is sitting in a RHD car in front of the steering wheel. The car's Autonomous Control System (ACS) has control of most functions, including steering, braking, acceleration, signaling (turn indicators, emergency indicator), low beam headlight, windshield wipers. The occupant's intervention, assistance, or oversight is only required in certain situations, for example, in unfamiliar, unexpected, unmapped, abnormal, emergency or malfunctioning situations, or when a potentially dangerous or illegal situation might arise.
[052] Fig 3 is an example of a fully autonomous vehicle. Here, human occupants are sitting in a car. There is no visible steering wheel. The car's Autonomous Control System (ACS) has control of all functions, including steering, braking, acceleration, signaling (turn indicators, emergency indicator), low/high beam headlight, windshield wipers, and defroster. The occupants' intervention is limited to emergency situations, wherein an emergency alert can be sent, or the car can be made to perform a subroutine like slowing down and stopping at the nearest safe location. Such situations, can include, for example: abnormal, emergency or malfunctioning situations. Such emergency maneuvers can be performed automatically, for example by pressing a button, or choosing from a list in a menu. A normally stowed steering can optionally be accessible. Driving skills are not required for most of these procedures or maneuvers.
[053] Figs 4a, 4b show different views of a trial autonomous vehicle with traditional sensors.
Lidar 401 uses pulsed laser light (of infrared wavelengths) to illuminate a scene and measure the reflected light pulses to create a 3D representation of the scene. The front camera array 402 can have one or more cameras. In the example shown, there are three cameras in the front camera array, each camera (operating in the visible wavelengths) having an approximately 60 degree horizontal field of view (fov), for a total of 180 degree coverage. The array can optionally have an IR camera with a wideangle lens. The side camera arrays each can have a single visible wavelength camera having a field of view of 60 degrees horizontal fov, and can additionally have an IR camera. The side cameras can be rotated about 30 degrees to avoid overlap with the front cameras. The back camera array can have a single visible wavelength camera having a fov of 60 degrees horizontal, and can additionally have an IR camera. In essence, the vehicle can have 360 degree horizontal coverage in the visible wavelength using 6 cameras. However, this arrangement can be varied. For example, the front array can be made to have 3 cameras with, the middle one having a 30 degree fov, and two 60 degree fov camera on either side, and wideangle cameras on the side and back so that, together, all the cameras provide a 360 degree fov. For acquiring stereo images, the camera counts can be doubled, and placed appropriately. The vehicle can include long range (405), medium range (406a, 406b - not visible) and short range (407a, 407b ¨ not visible, 408) radars systems. They map information from nearby and far objects (for example, up to 200 meters) related to the objects' velocity, size and distance. Ultra wideband radar systems can also be used. Ultrasonic sensors (404a and 404- not visible, but on the rear left wheel) sense the position of nearby objects.
[054] Since the human eye is one of the most used, useful, versatile and powerful sensors, a discussion of the eye relevant to this disclosure is provided. Figures 5a, Sal show details of a human eye. The outer part of the eye includes three concentric portions:
Cornea (501), Iris (502), and Sclera (503). The border of the cornea with the sclera is the limbus. The iris controls the diameter and size of the pupil (504) and determines the color of the eyes.
Pupil diameter is adjustable and controls the amount of light passing through it into the lens (504a). Pupillary constriction is thrice as fast as dilation. The retina (505) is the light sensing part of the eye, and has photoreceptor cells, of which cones comprise 6% and rods 94%. Rods and cones in the retina converts light falling on them into electrical signals, which are then sent through the optic nerve to the visual cortex in the brain for processing. The blind spot is the retinal area to which the optic nerves attach, and has no photoreceptors.
[055] Unlike rods, cones provide color vision. Rods have a low spatial acuity but are better at scotopic vision (imaging in low-light levels), and cones provide photopic vision with high spatial acuity. The macula (506) is an oval-shaped pigmented yellow spot near the retinal center and contains the fovea (507). The fovea is a small 1.5 mm diameter pit that contains the largest concentration of cone cells and is responsible for central, high resolution vision. Eye movements helps images of objects we want to see fall on the fovea. About 25% of visual cortex processes the central 2.5 degrees of the visual scene, and this relationship is eccentric as we move away from the fovea centralis. The fovea is rod-free, with a very high density of cones, which falls off rapidly away from the fovea and then levels off. At about 15 -20 from the fovea, the density of the rods reaches a maximum.
[056] Medial commisure (508) and lateral commisure (509) are the two inner corners where the eyelids join. Palpebral fissure is the opening between the eyelids. Canthal or commisural tilts are the angles between the lateral and medial commisures, with positive angles associated with the lateral aspect being above the medial. The lacrimal caruncle (510) appears lateral to the medial commisure.
[057] Eye movements alter the three-dimensional orientation of the eye inside the head and are controlled by three pairs of muscles to cause horizontal (yaw), vertical (pitch), and torsional (roll) eye movements. Eye orientation uniquely decides gaze direction. Figure Sal shows two such muscles: the superior oblique muscle (511) and the inferior rectus muscle (512).
[058] Fig 5b shows the axes of the eye. Illumination along the optical axis (on-axis illumination) will cause light to be retroreflected from the retina, causing the pupil to appear brighter than the surrounding iris ¨ similar to red-eyes in flash-photography, and is called the bright-pupil effect.
[059] Fig 5c shows different types of reflections from an eye which is illuminated by a light source. Light entering the eye is refracted and partially reflected at various layers. Reflection occurs at the outer corneal surface (called the first Purkinje: Pl, this is the brightest), inner corneal surface (second Purkinje: P2), outer lens surface (third Purkinje: P3) and inner lens surface (fourth Purkinje: P4).
[060] When looking at a person's eyes, the reflection we see on the eye is from the cornea (P1).
When imaging with a camera, infrared light can be used to illuminate the eye so that this IR light returning from the eye is selectively imaged, while the visible spectrum is muted or discarded.
Corneal reflection P1 of the illumination source appears as a spot. Iris reflection is dark (but has color information). The pupil commonly appears dark in the eye image when using off-axis illumination. In this case, light reflected from the retina is not imaged by the camera and therefore the pupil appears as a dark circle against the surrounding iris.
This arrangement is more pupil diameter variation tolerant than bright-pupil imaging.
[061] However, retinal retroreflection has strong direction dependence and can be bright at angles closer to normal causing the pupil to be bright. In this disclosure, unless otherwise specified, both the first Purkinje P1 (corneal reflection) and the pupil are detected and used for analysis of eye movements, and dark pupil imaging is used.
[062] When using pupil¨corneal reflection systems, calculation of the pupil center can be skewed by descending eyelids, downward pointing eye lashes, and use of mascara. To alleviate these issues, algorithms can work with the following assumptions: both the iris and pupil are roughly ellipsoidal; the pupil is centered inside the iris; the pupil is darker than the iris, which, in turn, is darker than the sclera.
[063] Fig 5d shows a diagram of the visual field including the fovea, parafovea and peripheral vision regions with an exemplary degree of the visual field that the regions can see. The fovea occupies 1.5 degrees of visual field, and provides the sharpest vision; the parafovea previews foveal information, and the peripheral vision reacts to flashing objects and sudden movements.
Peripheral vision includes approximately 15-50% of the acuity of the fovea and it is also less color-sensitive. In the human eye, the three vision field regions are asymmetric. For example, in reading, the perceptual span (size of effective vision) is 3-4 letter spaces to the left of fixation and 14-15 letter spaces to the right. One degree of visual angle is roughly equal to 3-4 letter spaces for normal reading distances. When fixated on a scene, eyes are oriented so that the center of the image of the scene falls on center of the fovea, which is called the point of gaze (POG).
The intersection of the visual axis with the scene can also be the POG.
[064] Eyes move during a majority of the time when awake. The different types of eye movements related to this disclosure are smooth pursuit, tremor, drift, saccades, glissades, and microsaccades. When looking at a scene, human eyes move around, rather than being fixed in a position. This movement locates regions of interest (ROI) in the scene to help the brain create a multi-dimensional map. For example, when looking at a (two-dimensional) photograph, the eyes make jerky but fast movements called saccades, and momentarily stop at several points called fixations. When looking at a scene on path of travel, for example a crowded city road, a three-dimensional map is created. Monocular eye movements are called ductions.
Movement nasally is adduction, temporal movement is abduction, elevation is sursumduction (or supraduction), depression is deorsumduction (infraduction), incycloduction (intorsion) is nasal rotation of the vertical meridian, excycloduction (extorsion) is temporal rotation of the vertical meridian.
[065] Binocular eye movements, wherein the two eyes move in the same direction, are called conjugate movements or versions. Dextroversion is movement of both eyes to the right, levoversion is movement of both eyes to the left, sursumversion (supraversion) is elevation of both eyes, deorsumversion (infraversion) is depression of both eyes.
[066] Depth perception (stereopsis) is extracted from binocular disparity (disjugacy), wherein the difference in image location of an object seen by the right and left eyes is caused by the horizontal separation (parallax) between the eyes. Vergences are simultaneous movements of both the left and right eyes in opposite directions (which can be converging or diverging) to provide single binocular vision. These disconjugate movements prevent double vision (diplopia) when a foveated object moves in space, for example, from a far distance to closer to the eyes.
When moving left to right, a temporal non-syncrony can occur, wherein the abducting eye moves faster and longer than the adducitng eye, with this misalignment being corrected at the end of a saccade through glissades and drift. Most humans have a dominant eye, which may be directed in a different direction from than the passive eye.
EYE MOVEMENT EVENTS
[067] Fixation is when the eye temporarily stops at a location while scanning a scene. Fixations allow re-positioning of the fovea over ROIs to acquire and compose higher resolution information in conjunction with the nervous visual processing system. The range for fixation durations is 0.1 to I second, typically 200-600 ms. The typical fixation frequency is less than 3 Hz. Fixations are not complete stillness, but can include three micro-movements: tremors, microsaccades (to quickly bring eye back to its original position), and drifts (slow movements taking the eye away from fixation center), or very low gaze velocities (below 10-50 degrees per second).
[068] Saccades are rapid movement of the eye between fixation points, and are events where the eyes move fast and ballistically, with durations in the range 20-100 milliseconds, during which period we are blind. Saccadic velocities can be in the 20-500 degrees per second range, with some peak velocities of up to 900 degrees/second. Saccades are rarely a straight line between two points, they take several shapes and curvatures. The end of a saccade is not abrupt-the eye wobbles before stopping. This post-saccadic movement is called a glissade, and do not appear at the beginning of a saccade. They are used to realign eyes before a steady fixation. This settling is similar to a precision motion controlled closed-loop stage settling when at destination leading to positional "ringing".
[069] The time between a stimulus and start of a saccade is the saccadic latency, which varies depending on the saccadic amplitude that follows, and are usually in the range of 100-350 ms.
For a 5-10 degree saccade, the latency can be 200 millisecond. Refractory periods between saccades can be built into saccadic latencies or identified as being distinct periods in cases of very short or an absent inter-saccadic fixation, for example, when another saccade is required to be performed immediately following a saccade. Additional requirements can be set in the software interface, for example, clear peaks, maximum velocity.
[070] Smooth pursuit are slows motions of the eye as it follows a moving target, for example, an airplane in the sky. During smooth pursuit, the gaze position can lag the target, and the eye makes catch up saccades to re-foveate the target. Overshoots are corrected using back-up saccades, while leading saccades are anticipatory saccades. Velocities of smooth pursuit increases with straighter paths.
[071] Square-wave jerks are conjugate saccadic intrusions in the eye movement while tracking a target that causes the eye to lose tracking position and then restores it.
They consist of pairs of small saccades in opposite directions which are separated by saccadic latency.
[072] Dwell has a specific meaning in this disclosure ¨ it is the time spent in one region of interest (R01). Dwells have starting and ending points, durations and dispersions, but are different from fixations because their temporal and spatial extents are larger than fixations.
Transitions are gaze shifts used to move from one ROI to another. In one-way transitions, and unlike two-way transitions, gaze does not return to the same ROI right away.
Gaze revisits occur when gaze returns to the same ROI, but after other transitions have occurred in between.
Attention maps show the spatial distribution of data. An example is a dwell map, which is a pictorial illustration of all ROIs with a dwell time over a threshold. While viewing a dynamically changing image like a car driving along a crowded city street, the ROIs are dynamically changing. The attention map of the traversed path will have dynamically changing ROIs, and therefore have dynamic attention maps indicating dynamically changing heat and dwell times
[073] Tremor has a specific meaning in this disclosure ¨ it is a fixational eye movement of amplitude less than 1 degree, and peak velocities of around 20 second/sec.
[074] Blinks are events surrounding the time period when the eyelid is closed, and can be algorithmically defined as loss of positional signal for a threshold duration in combination with eye movement distance data loss, for example, 50-100 ms over 10-20 degrees.
During blinks amongst much of the population, the descending upper eyelid covers most of the cornea. Blink durations increase with drowsiness, alcohol levels, schizophrenia and similar disorders. In this disclosure, blinks are not considered a part of eye movements ¨ unlike saccades, glissades, microsaccades, tremors, dwells, smooth pursuit, and square-wave jerks.
Although blinks are recorded, they are used to determine the cause of discontinuity or anomalies in data that are not explainable by eye movements. To reiterate, blinks do not form a part of eye movements in this disclosure.
[075] Eye-in-head fixations occur when the eye not moving relative to its socket, for example when the head is moving along with the stimulus. Eye-on-stimulus fixations occur when the eye is fixated on the stimulus, but moves inside its socket to track as well as compensate for head motion. In normal driving situations, the head is free to move, and therefore both the eye and head move when tracking objects at a high angle away from the median plane of the subject. The median plane is considered to be the same as the as the central plane of the steering wheel.
[076] Fig 6 shows a screenshot of an eye movement tracker. The eye is illuminated by an IR
source, and the acquired image has a dark pupil. The cornea (601) and pupil (602) have been identified, along with the corneal reflection (601a). Crosshairs through the corneal reflection center (601b) and the pupillary center (602a) are overlaid by the image analysis system.
[077] Eye trackers are made by several companies, including SMI, Gazpet, lmotions, Tobil, ASL, SR Research, SmartEye, Seeing Machines. Binaural microphones are made by 3D10 and Roland and others.
[078] Mounting of eye trackers can be on the subject's head, on a tower on which the subject's head is resting, or remote from the subject. A combinations of mounting schemes can also be used when required. For example, a configuration can have remote cameras and illumination, but head mounted inertial measure units (IMU) to detect head position in space.
Or, another configuration can have dashboard/dashpad mounted illumination combined with head-mounted cameras. Apart from cameras used to image the eyes, eye tracker units can be combined with scene tracker cameras that capture the scene being viewed along or parallel to the line of sight.
These trackers can be mounted on the frame, dashpad or on the outside of the vehicle. Images from the eye and scene trackers can be combined to produce gaze-overlaid images. Furthermore, using head/facial feature detection, head tracking cameras can also be added to these systems.
[079] Most commercial eye trackers have options to adjust camera and illuminator positioning (linear and angular). Cameras can be automatic or manual focusing or require no focus adjustments. Automatic adjustment of linear and angular camera positions can additionally be carried out using feedback from the eye tracker's image analysis system. Eye movement measures can include direction, amplitude, time duration, velocity, acceleration, and time differential of acceleration. Tracing of a subject's eye movements spatially and temporally provides the scanpath events and representations, including saccades and fixations.
[080] In non head-mounted eye tracking systems, extreme gaze angles will cause precision and accuracy deterioration in eye trackers, particularly when combined with head rotation. Multiple cameras and illuminators positioned appropriately can overcome such issues.
CALIBRATION
[081] Eyes vary widely within the population, and also from the ideal model, because of non-uniform shapes of the eye's components (like cornea and lens) between individuals. Variation between the two eyes of the same individual is also common. Saccadic amplitudes vary within a population for the same scene or task, and also vary between the two eyes of the same subject.
All of these variations can occur within the "normal" population, or can be caused by abnormalities.
[082] Identifying and accounting for these variations will help deliver better eye-movement data. A discussions of variations and abnormalities follows, which can be used for calibration purposes whenever needed. Calibration can be carried out before the start of a path before the vehicle starts moving, or in between a path, or at the end of it. calibration can be instructive or interactive. For example, the driver can be prompted to look straight ahead, then look at side view mirrors, then the rearview mirror, then look ahead but into the sky (for infinity- least focus power of the eye's lens). Calibration can provide examples of specific pupil and corneal reflection relations to the tracker. Initial calibration of each individual's left and/or right eye can provide offset factors or equations for compensation when using a global equation based on the assumption of an ideal eye. For those wearing glasses, calibration can be made with and without glasses. Drivers can be instructed to close one eye at a time performing calibrations. This can detect abnormalities as well as the dominant eye. Calibrations using four gaze positions can account for several non-ideal eye conditions. Lateral and medial commisures, lacrimal caruncle, and canthal tilts can also be identified during calibration, some of which can be used as landmarks or account for head/camera tilts. Visible light sources like red laser LEDs can be used to calibrate eye movement as well as autorefractors.
[083] Drugs, alcohol, mental and physical disorders, age (very young children and very old people) will all affect eye movement. Data acquired from such subjects can be be adjusted or eliminated by identifying them as outliers. A similar situation arises with contact lenses, thick or complex lensed spectacles, heavy mascara, drooping eyelids (ptosis), squint (due to glare or laughter, for example), teary eyes and subjects with prominent epicanthic folds. If such subjects are a large subset of the general population, eliminating them can provide data that is not truly representative. When such subgroups become a significant proportion of the data population, hardware and/or software settings can be altered to utilize data without discarding them as outliers. Pupil size changes with the amount of ambient light, drugs, cognitive load, emotional state, fatigue, age. In subjects with anisocorea, pupillary sizes, including during dilation and constriction (mydrisasis and miosis), can be different between the two eyes.
Consensual is the normal phenomenon wherein both pupils constrict or dilate even when one eye is closed. Pupil size is sensitive to the angular displacement between the camera and the eye being imaged.
[084] Crossed-eye (strabismus) is present in varying degrees in about 4% of the population, and can be esotropic (nasally convergent) or exotropic (divergent). Strabismus can be comitant (present in all directions of gaze) or incomitant (varies with varying directions of gaze), or hypertropic (vertically misaligned).
[085] Eye trackers can have biases, noise and other statistical anomalies that are inherent to their software, hardware and optical system. Using eye trackers in moving vehicles can compound this due to vibration, thermal cycling and other non laboratory environments/non-deal settings. Using artificial eyes fixed in position can help detect and account for these issues when analyzing acquired data (for example, by using filters and offsets), and thereby improve accuracy, precision and confidence. Averaging data from the two eyes of the same subject can substantially improve precision. However, this comes at a cost, for example, in terms of loss of information related to vergences. Filtering and de-noising functions can be used to overcome these.
[086] If the head were constrained from moving, and only the eyes are moving within their sockets, a single camera and single infrared source can be used for detecting eye movements.
Multiple cameras and IR sources can be used for better results if head movement is allowed.
Sampling frequencies (frame rate per second) of currently available lower-end cameras start at 50 Hz, and the higher the sampling rate, the better the quality of results, particularly when analyzing saccades, glissades, tremors and microsaccades.
[087] In commercially available software, parameters settings are used to identify specific events and separate them. These parameters include sensitivity settings for each category of events, saccade-onset, steady-state and end-velocity threshold, and acceleration threshold. Since different manufacturers use different algorithms, hardware and software settings, these parameters are not universally applicable. In many instances, the user interface is simplified to provide a few descriptive settings like low, medium, and high.
[088] Algorithms used to extract events from eye movement data can detect gaze position, velocity, acceleration and jerk, each of them being the time derivative of its predecessor.
[089] In an embodiment, dispersion algorithms are used to detect fixations without using velocity and acceleration data to extract fixation onset and offsets. In an embodiment, probabilistic modeling of saccades and fixations are carried out using Hidden Markov Models. In an embodiment, detection of events relating to gaze, fixation, or saccades to near objects, like control buttons on a vehicle's steering, is carried out by identifying events where glissadic movements are different for each eye of the subject, but where microsaccades occur in both eyes at almost the same time.
[090] During a backtrack, a saccade following a previous saccade occurs in the opposite direction. Look-aheads saccades allow gaze to shift and fixate upon objects that will soon need to be used in some way. this is contrasted with saccading to other objects that may be used in a future planned action, for example, saccading to a radio knob on the dashboard of a vehicle to increase its volume. Ambient processing involving longer saccades and shorter fixations are used to scan the important or critical features of a scene first, followed by focal processing for detailed inspection using shorter saccades and longer fixations within regions. Target selection along a scanpath is guided by past experiences and memories of driving under similar conditions, or similar paths, or the same path, which also avoid revisits of earlier targets that are inconsequential. For example, consider driver "A" driving home at 6 pm from a his place of work, which he has been doing for the last 5 years as a matter of routine. He will ignore most traffic signs ¨ although he sees them in his peripheral vision, he will not foveate/saccade to them, However, he will pay attention to traffic lights, saccading slower to the lights because of their expected presence. Saccades will be reduced in number and saccadic velocities (when compared to driving though an unfamiliar path), while fixations and their durations will increase.
[091] Saccadic Directions and Amplitudes: In an embodiment, when there is negligible or no inter-saccadic dwell or fixation between two saccades, and the first saccade's travel was greater than 15 degrees, the two saccades are considered to be purposed for targeting the same object but broken down into a first saccade and a second corrective-saccade. As an example, this can occur when entering intersections or roundabouts, where there is a requirement to scan extreme angles to inspect traffic entering the roads ahead. A similar situation arises when entering a highway from a minor road, wherein the driver is required to check the lane ahead as well as the traffic behind. In routine driving, viewing objects away from the fovea using peripheral vision does not allow for fine-detail cognition. However, details like traffic on adjacent lanes far ahead is relatively unimportant. It is usual to search for details within close proximity to the current ROI
using foveal vision, for example, close by vehicles in adjacent lanes. When viewing a road, saccades to nearby locations can be more common after a fixation (for example, a child on a bicycle, and checking if there are adults accompanying the child), rather than large amplitude saccades to distant locations. In an embodiment, when the distances between objects are very small (on the order of 10 arcminutes), for example, a multitude of pedestrians on the sidewalk, an absence of saccades between the pedestrians is not taken as a lack of cognition of all these pedestrians by the driver, but rather advantageously devoting extra cognitive resources for the available retinal resolution in the peripheral vision and perceiving these pedestrians at lower resolution, all the while using foveal vision to perceive other more important objects on the road.
When a subject is searching intently (as opposed to performing general overviews), or when concurrently performing other unrelated tasks, saccadic amplitudes tend to drop. Saccadic velocities decrease with drowsiness, predictable targets, older age, neurological disorders, and drug and alcohol use.
[092] In an embodiment, when tracking objects using smooth pursuits, for example, a bird taking off from the middle of a road and flying vertically, signature detection algorithms are programmed to accommodate jumpy vertical smooth pursuits. In contrast, this is not the case for horizontal smooth pursuits, for example, when a ball rolls across the road.
[093] In an embodiment, a specific instance of a table listing settings for threshold and cutoff values for a program having a set of subroutines suited to a particular scenario, imaging and sensor hardware, software and hardware setup is given below. These settings can change from instance to instance.
Type Duration ms Amplitude Velocity Fixation 100-700 Saccade 30-80 4-20 degrees 30-500 degrees/sec Glissade 10-40 0.5-2 20-140 degrees/sec degrees Smooth pursuit 10-30 degrees/sec Microsaccade 10-30 10-40 15-50 degrees/sec seconds Tremor <1 degree 20 second/sec peak Drift 200-1000 1-60 seconds 6-25 seconds/sec BINAURAL SENSING
[094] Figs 7a, 7b show front and side views of a binaural-recording mannequin-head having a microphone (not shown) in each ear at the end of their ear canals (701). The head can be made of plastics or composites, while the pair of life-like ear replicas are made from silicone. Fig 7c shows the placement of microphones (702) inside the mannequin. The mannequin's head is similar in shape and size to a regular human head, but lacks many features like lips and eyes. It has ears that resemble the size and geometry of a human ear. The nose is a straight-edge representation of a human nose and casts a shadow of the sound. Sound wraps around the mannequin-head, and is shaped by the geometry and material of the outer and middle ear. Some of the is transmitted through the head. The two microphones record sound in way that, when played back, a 3-D 'in-head' acoustic experience is created. The mannequin mimics natural ear spacing and produces a "head shadow" of the head, nose and ears that produces interaural time differences and interaural level differences. Such an arrangement captures audio frequency adjustments like head-related transfer functions. Fig 7d shows a variation of the binaural device that is truncated above and below the ears of the mannequin.
[095] Fig 8a shows the placement of such a truncated mannequin-head in a car above the driver-side headrest, with its nose facing the driver. A bluetooth device (not shown) within the mannequin (801) transmits the sound stream from the microphones to a recording device in the car. This recording device can be integrated with the human sensor recording system or a standalone unit which timestamps the sounds as it records it. Fig 8b shows a whole mannequin head (802) placed on the backside of the backrest, aligned with the drivers head. The nose is above the headrest and facing the driver. Fig 8c shows a full mannequin head (803) just as in fig 8a, but anchored to the roof of the car. Other possible configurations include placing the complete or truncated head on the dashpad, above the rearview mirror, on the passenger-side seat's head-rest, and on top of the driver's head (using a head-strap). A
frame-mounted variation of the binaural recording device using a set of smaller ears and without the intervening mannequin-head, is shown in figs 14a-14d. The outer ears and ear-canals of this frame-mounted binaural device are made of silicone, with a microphone each at the end of the ear canal.
[096] Fig 9a shows a steering wheel with a hand sensing mat (901) wrapped around the outer wheel. Fig 9b shows a segment (901a) of the hand sensing mat (901). The mat has eight sensors (902) in a row (902a) along its width. The length of the mat is chosen to fit completely around the steering wheel. In the example of fig 9a, there are 64 sensor rows (902a) arranged circumferential, with each row having 8 sensors. Each sensor (902) has a sensing pad that detects both contact and pressure of the palms and fingers. Fig 9c shows an enlarged section of the mat of fig 9b, with a cross-section through a row of mats appearing in fig 9d.
Each of the sensors (902) are connected to a bus (serial or parallel) (903) that is connected to a processor (not shown). All the rows (902a) are connected to this bus. Each sensor (902) has a unique address.
When a sensor (902) is touched or pressed, the touch event and pressure value is sent via this bus (903) to the processor. In an example operating scheme, the steering wheel is programmatically divided into left and right halves. In fig 9a, the left side has 32 rows of 8 sensors (for a total of 256 sensors), and the right side the same. Therefore, there are a total of about 1.158 x 101'77 unique combinations. To derive a simpler correlation, the rows can be numbered 1 to 32. An example of hand sensing information obtained during casual driving of a right hand drive (RHD) at constant speed along a particular segment of a path on a highway with wide roads and very little traffic, where the highway is fenced, has central divider and 3 lanes in each direction, touches denoted as Left is: [-34.82088284469496,149.38979801139794];
[t10:32:28]; [y(12), x(2, 3, 4, 5, 6, 7), p(3, 3, 3, 2, 1, 0)]; [y(13), x(3, 4, 5, 6), p(3, 2, 2, 1)]; [y(14), x(4, 5), p(1, 0)];

[y(15), x(4)], p(0); R:[]. This data point indicates that at the recorded latitude, longitude and time hrs, 32 min, 28 sec, the left side of the steering wheel was contacted at row 12, sensors 2, 3, 4, 5, 6, 7 with pressure on these sensors of 3, 3, 3, 2, 1, 0, respectively. A
similar interpretation applies for the remaining y, x, p values. The zero in the pressure data indicates no pressure is being applied, but there is contact. Pressure values are dimensionless, with a range of 0-7, the highest value indicating extreme squeezing of the steering wheel. R[] is a blank data set indicating that the right side of the steering has no contact with a hand (right side of the steering wheel is not being held). For a very simplified understanding of this dataset, the pressure values can be added: [(3+3+3+2+1+0) + (3+2+2+1) + (1+0) +(0)]=21 to indicate that the left hand is engaging the steering wheel at pressure 21 at the particular location and/or time, whereas the right hand is not holding the steering wheel. This can indicate a relaxed, simple and familiar driving situation where the driver is not required to be very alert. This situation can be contrasted with driving in a crowded city road that is un-fenced, undivided, with a lot of pedestrians on the sidewalks, bicycle lanes, traffic lights, intersections, frequently stopping vehicles like buses. The driver in such a situation is much more alert and cautious, with both hands on the steering wheel, gripped tighter than the usual. If the driver is new to this city and this traffic pattern, the level of alertness and caution will be even greater, and the grip on the steering wheel tighter. Calibration can be performed by asking the driver to perform different operations, for example, holding the steering with both hands without squeezing, then full squeeze with both hands.
[097] Fig 9e shows a variation of hand sensing device. Many drivers hold the steering wheel with their palms and fingers on the steering wheel's ring portion and their thumbs resting on the spoke portions. To sense fingers/thumbs on the spoke portion, additional sensor mats (901a, 901b) are wrapped around on each of the spoke portions adjacent the ring. In fig 9e, each side of the ring-spoke intersection has 8 rows of 4 sensors each.
[098] When the vehicle is being turned steeply using the steering wheel, for example, at a left hand turn at a 4 way intersection in LHT, the turning action by the driver will cause the gripping the opposite sides as the steering wheel is rotated (as the turn progresses).
This pattern can be detected by algorithms, and used appropriately (for example, to detect sharp turns), or the dataset can be discarded if not appropriate for the present computation.
[099] Figs 10a-10h show foot position sensing concepts. Fig I Oa shows the inside of a LHD car with accelerator (1001) and brake (1002) pedals. Figure 10b shows a close up of modified accelerator and brake pedals, each having a proximity sensor (1001a, 1002b) mounted on them.
Proximity sensors can be any one of those known in the art, including capacitive, light, ultrasonic or time-of-flight (TOF) type. For example, the VL6180 TOF sensor made by ST-Microelectronics can be employed. Fig 10c shows another example of the foot position sensing concept. Here, there are two sensors (1003a, 1003b) on the brake pedal and two sensors on the accelerator pedal (not shown). Only the brake pedal is illustrated in this figure. The distances measured by the brake pedal sensors (1003a, 1003b) are dbp-foot (1004a) and dbp-wall (1004b), respectively. Similarly, the distances measured by the accelerator pedal sensors are dap-foot and dba-wall, respectively (these are not indicated in the figures, but are similar to the brake pedal).
It is to be noted that dbp-wall and dap-wall are set to a value of zero when not being depressed, which is not apparent from fig 10c (which shows the actual distance between the pedal and wall).
This can be done during calibration of the pedals at startup. When the pedals are depressed, dap-foot and dba-wall will actually return the values of how much they are depressed, not their distance from the wall.
[0100] Fig 10d shows an arrangement with three TOF sensors on each of the brake and accelerator pedals, two on the front face and one on the backside (not shown) facing away from the driver. Having two sensors on the front surface allows enhanced mapping of the foot by performing one or a combination of mathematical operations on the measurements performed by these front sensors. These operations can include: averaging, using data from the sensor that is currently providing the highest value, using data from the sensor that is currently providing the lowest value. Furthermore, a calibration step can be incorporated during startup of the vehicle, where the driver is prompted to perform various operations to obtain baseline values. These operations can include: asking the driver to rest the foot on, but not depress, the accelerator pedal, then the same operation for the brake pedal, then depressing each pedal (while the vehicle is in park mode).
[0101] Figs 10e-h shows the principle of operation of the foot sensor, with the both the pedals having two sensors each as described for fig 10b. In figs 10e and 10f, the foot is on the accelerator and brake pedals, respectively. The proximity sensor on the pedals on which the foot is on will now record its lowest value. When an anticipation of braking arises, for example, when driving a car and an unaccompanied child is seen 200 meters ahead, standing by the edge of the road and facing the road, the foot goes off the accelerator and moves over the brake pedal, hovering about 8 cm over it as in fig lOg (only brake pedal shown). As the car approaches closer and is 100 meters from the child, the foot gets closer to the brake pedal, and is now 4 cm over it.
At 75 meters, the foot is on the brake pedal, but not depressing it, as in fig 10h (only brake pedal shown). At 50 meters from the child, the brake pedal is slightly depressed to slow the car. The foot remains on the pedal until after crossing the child, and immediately removed from the brake pedal and the accelerator pedal is depressed.
[0102] As an example of foot and accelerator dataset for a short segment in a path, consider a driver driving a car through a suburban area having residential houses and schools during a school day and school hours. When either the brake pedal or the accelerator pedal is not depressed, dbp-wall=0 mm, and dap-wall=0 mm. Assume sample data capture starts at time t=0 seconds. The foot is on the accelerator pedal to keep it at a constant speed of 50 km/hour, and this continues for 30 seconds. During this period, dap-foot=0 mm, dap-wa11=7 mm, which means that the foot is on the accelerator pedal and depressing it by 7 mm. As the car approaches a school zone, the foot goes off the accelerator and depresses the brake pedal to reduce the speed to the legal speed limit of 40 km/hour, which occurs for 4 seconds. During this period, dap-foot=0, dap-wall=0, dbp-foot=0 mm, dpb-wall=5 mm. This is an expected pattern of driving, and can be compared with the map, which will similarly indicate a school zone with reduced speed requirement. However, after entering the school zone (at t=35 seconds), it always is a possibility that children will dart across the road. The foot is therefore mostly hovering over the brake pedal, particularly when getting close to the entrance of the school, in anticipation of needing to brake to avoid children darting across the road. At t=37 seconds, dap-foot=0, dap-wall=0, dbp-foot=0 mm, dpb-wall=5 mm, which means that the foot is not hovering over the accelerator pedal or being depressed, the foot is on the brake pedal and pushing it down by 5 mm.
Just after t=37 seconds, the driver notices a small child on a bicycle exiting the gates of the school and driving towards the road. There is a possibility that the child is not paying attention to traffic, and may enter the road ahead. At t=39 sec, the foot depresses the brake pedal for 2 seconds to slow the car down to 20 km/hour. The corresponding values for these 2 seconds are: dap-foot=0, dap-wal1=0, dbp-foot=0 mm, (dpb-wa11=5 mm to 12 mm). This sequence of values from t=0 to t=41 values can be stored in a file along with GPS and timestamps. The average of such data sequences collected by several different drivers over several days can be used as a training file for an AV.
Data from the training file will indicate school days and hours because of the behavior of drivers, and also the average speed to be followed, and also the speed profile for the section of the segment of the path.
[0103] Figs ha-c shows the inside of a car with various arrangements of cameras and IR
illumination sources for eye movement tracking. Fig Ila shows the inside of a non-autonomous car being driven through a path by a driver. The car is equipped with UPS, inertial measurement unit (IMU), LIDAR, radar, outside cameras, inside cameras and other outside environment and vehicle sensors shown in fig 24. The inside cameras track the subject's eye movements and head movements. The video data stream is saved with time and GPS/IMU stamps. The saved stream is them analyzed by an image processing system to extract event information, including saccades, microsaccades, glissades, tremors, fixations, drift. A map incorporating the path's roads, timestamps, GPS/IMU coordinates, speed profiles, driver behaviors (lane changes, turn indicators, braking, accelerating, foot going off the accelerator pedal and moving/hovering over the brake pedal, vehicle behaviors (turning radius etc). In Fig 11a, the car has one imaging device (1101) on its dashpad. The device is closer to the windshield than the edge of the dashpad. The device has one camera (1101a) and two IR illumination sources (1101b, 1101c).
The center of the steering wheel is on the same plane as the sagittal plane of the driver, and the center of the camera is co-incident with the plane that connects the sagittal plane with the central axis of the steering wheel. Figure llb shows an arrangement with an imaging device (1121) placed on the dashpad and having two cameras (1121a, 1121b) and two IR illumination sources (1121c, 1121d). The two cameras are each offset from the central axis of the steering wheel by 4 cm.
Figure Ilc shows an arrangement with an imaging device (1131) having two cameras (1131a, 1131b) and two IR illumination sources (1131c, 1131d), and two additional cameras (1132a, 1132b) with these additional cameras having two IR illuminators for each (not labeled in the figure). The two central cameras are each offset from the central axis of the steering wheel by 4 cm each, while one of the additional cameras is placed below the central axis of the rearview mirror, and the other is placed along the central horizontal axis of the driver-side sideview mirror, and at an elevation lower than the other cameras.
[0104] Figs 12a-f shows the inside of a car with various arrangements of phone cameras. The cameras of figs12a-12f can have a clip-on filter (an example appears in fig 12c) whose transmission wavelength matches the illumination source wavelength. For example, if the phone's screen were programmed to display a narrowband blue wavelength, then the filter's transmission characteristics would match the same color. The filter can be of any type, including absorption and reflectance. Examples of a phone's screen color characteristics are described with reference to fig 12e below. In addition to, or as a replacement for, the filter, a snap-fit or clip-on lens system (single lens or multiple lens or zoom-lens) can also be added to reduce the field of view so that a much larger proportion of the driver's head is captured, thus giving a higher resolution of the eyes. Such a zoom-lens can be connected to the a system that processes the image acquired by the camera so as to make the zoom-lens auto-focus on the driver's eyes, giving a better focus as well as a higher resolution image of the eyes (by filling in more of the camera's sensors with relevant portions rather than unnecessary background).
These filter and lens/zoom-lens arrangements can be adapted to use for front-facing as well as rear-facing cameras.
[0105] Fig 12a shows a mobile phone (1201) with its back facing camera (1201a) facing the driver and located along the driver's sagittal plane. The phone is secured on a stand (1201b) which is on the dashpad. The phone stands/holders in all embodiment in this disclosure can have tip-tilt adjustment. Such tip-tilt adjustment helps align the camera to account for the inclination/irregularities of the dashpad, angle of the windshield, driver height and placement, steering wheel height, and camera body irregularities. In the embodiment of fig 12a, illumination can be using ambient light, light from the phone's screen, or external illuminator (as embodiment of external illuminator is shown in fig 12d), or a combination. The quality of data obtained from ambient light is much lower compared to when using pure IR illumination, and not all eye movement events can be computed from this data. Fig 12b shows the image obtained by the phone's back-facing camera in a non-zoomed mode. The camera can be zoomed in to capture more of the face (and eyes) and less of the background, which will also improve accuracy during event extraction and provide higher quality results.
[0106] Fig 12c shows an imaging arrangement with two mobile phones (1202, 1203) with their front facing cameras (1202a, 1203a) facing the driver. The phones are secured on individual stands which sit on the dashpad of the car. There are no illuminators present except ambient light and/or light from the phone's screen. A clip-on filter (as described earlier) (1202b, 1203b) is also shown.
[0107] Fig 12d shows an imaging arrangement with a mobile phone. The phone (1204) is lying on its back (screen facing dashpad), with its back facing camera (1204a) facing the windscreen and located along the driver's sagittal plane. The phone is secured on a base stand (1204b). Two IR illuminators (1204c, 1204d) are aimed towards the eyes of the driver. A
patch (1204e) of width 15 cm and height 10 cm is affixed to the windscreen. The center of this patch is aligned with the center of the camera. The base stand (1204b) has tip-tilt adjustment.
The base is adjusted such that the camera's center images the driver's forehead at a center point between the eyes. The size of the patch will be smaller the closer it is to the camera. It is wide enough to image an area that is three times the width of the driver's head, with a proportional height. The patch is optimized for full reflectance of IR wavelengths (the specific wavelength band being the band at which the IR illuminator emits light) at angles of 40-50 degrees, preferably 43 to 48 degree angle of incidence. In this imaging arrangement, the placement position of the camera (on the dashpad) and its tip-tilt setting, the placement position of the patch on the windscreen, and the height of the driver are interrelated. The goal here is to place the path as low as possible on the windscreen, without its line of sight being obscured by the steering wheel, while at the same time centering the driver's eyes on the camera. Light from the IR illuminators is reflected from the eyes of the driver and is then reflected of the patch into the camera. The patch does not obstruct visible wavelengths, and therefore the driver is able to see through the patch without the view of the road being obstructed. The patch can be custom made for a specific vehicle model, including its optimum reflection angle at the required angle, taking into account the angle of the windshield.
[0108] Fig 12e shows a mobile phone (1205) with its front facing ('selfie') camera (1205a) facing the driver and aligned approximately with the center of the steering wheel. The back facing camera (not shown) faces the road ahead. The phone is secured on a stand (1206) which is on the dashpad. Illumination is provided by the camera's screen (1205b). The screen has four squares (1205c, 1205d, 1205e, 12050 of a particular narrowband color, while the rest of the screen is is blank (black). These four squares act as illuminators. The particular color can be, for example, the wavelength of 450 nm (blue), with a narrow bandwidth of +/- 15 nm. The color can be chosen to be that of the particular phone model's peaks of display illumination intensity. In another example, this color can be 540 nm +/- 10 nm. Generally, the bandwidth is chosen to be narrow for intensity curves around a peak that are more flattened, and broader bandwidths for intensity peaks around which the intensity curves are steep. The imaging software (of the camera) is programmed to discard (from the acquired images) wavelengths above and below the narrowband wavelengths. The advantage in this imaging setup is that eye movement tracking becomes much more sensitive because the reflections of the four squares from the eye can be captured while rejecting ambient light, including reflected ambient light. The four squares can also each have a different narrowband color, or two of the same color, or any such combination.
The phone's software is programmed to cause the screen to display these specific narrowband colored squares, and the phone's imaging software is set to reject other colors from the images captured by the camera. Instead of being a square, the shape of the illumination areas can also be another shape, like a circle, triangle, line or a grid pattern, or other patterns similar to those appearing in fig 12f. The size of these squares and other patterns can be of a size that works well with the particular zoom level. The color of the illuminator patterns can be set to change periodically, for example, change the color of each square every 0.1 second.
The color of the illuminator patterns can be set to change automatically depending on the dominating ambient colors. For example, when driving through roads surrounded by greenery, the phone detects this dominance and automatically changes the color of the patterns to another color. If greenery and blue-skies are dominant, the color of the pattern is automatically changed to another color like red.
[0109] Fig 12f shows a variation of fig 12e. This arrangement shows two mobile phones (1206, 1207) with front facing cameras (1206a, 1207a). Each of the mobile phones have, on their screen, different patterns. The first pattern (1206b) is a circle with a crosshair through it, and second is a solid circle (1206c), the third (1207b) has two concentric circles, the fourth (1207c) is a solid square. As with fig 12e, each of the patterns can have the same colors or different colors, or the crosshair can be of one color while its circle can be solid and of a different color, or the two concentric circles can be of different colors. The patterns are generated by the phone, and the imaging software can be programmed to identify these specific pasterns as reflected from the eye.
[0110] Figs 13a-d show details of an embodiment of mobile phone camera imaging arrangement with an adapter. Fig 13a shows the placement of the phone (1310) on the dashpad (1311a), with a portion of it overhanging into the dashboard (1311b). The camera is aligned to the center of the steering wheel. The adapter (1312) has, built into it, two mirrors (1312a, 13126) and two filters (1312c, 1312d). The two mirrors are used to direct light (that has been reflected from the eye) into the back facing camera (1312e). Both the back facing camera (1312f) as well as the front facing (1312e) camera capture images. As is with most mobile phone cameras, the front facing camera captures a much larger area (but at a lower resolution) compared to the rear facing camera (which has a much higher resolution). The front facing camera is used as a coarse indicator of the eye position in the scene being captured, while the rear facing camera captures the finer details that are useful for eye movement event extraction. The rear facing camera can also be made capable of optical zoom (as opposed to software zoom) to get close-up images of the driver's eyes. These filters (1312c, 1312d) cutoff all wavelengths above and below the illuminator's narrowband wavelength. The illuminator can be external sources, for example, IR
sources, or patterns on the phone's display. In an alternative embodiment, the optical filters can be dispensed with, and a scheme for software implemented filtering as was described for figs 12e can be used. The mirrors can be chosen to be fully reflective for all wavelengths, or in an alternate embodiment, selected for reflection only in the narrowband illumination wavelength.
[0111] Figs 14a-c show various arrangements of frame mounted eye and sound imaging systems.
Fig 14a shows a frame mounted eye movement and ambient sound imaging system as worn by the driver, with only a truncated head of driver shown in the figure. Fig 14a1 shows the top view of fig 14a, while fig 14a2 shows the bottom view of the frame worn of fig 14a.
The frame is symmetrical, including components mounted on it, and therefore components on only one side of the frame are numbered. The frame (1401) has an inertial measurement unit (IMU) (1401a) with a master timing clock. The IMU allows absolute position tracking of the head of the driver. On each side of the frame, there are: binaural recording device (1401b), two cameras (1401c, 1401d), and three IR illuminators (1401e, 1401f, 1401g). It should be noted that imaging of just one (for example, the dominant eye) can be carried out instead of binocular imaging in both head mounted as well as remotely mounted (dashpad) systems. Of course, details like vergences that are related to binocular vision will be lost. Frame-mounted eye movement imaging systems, unlike dashpad mounted systems, are not aware of when the head is moving. IMUs help extract eye movement information if and when there is associated head movement, for example, in eye-in-head fixations. Both the eyes and head move when tacking objects at a high angle away from the steering wheel. In this disclosure, all reference to eye movement data assumes that head movement has been taken into consideration. It should be obvious that dashpad or other remotely mounted cameras (IR or visible wavelength) can be used to detect head movement instead of using IMUs.
[0112] Fig 14b shows an embodiment of a frame mounted eye movement and ambient sound imaging system (upper figure), and the bottom view (lower figure) of this frame. The frame (1402) has symmetrical components, and an extra IMU (1402d) in the center portion. Only one side of the symmetrically placed components are identified in the figure.
Prescription eyeglass (1402a) is clamped on to the frame using a hard polymer clamp (1402b). Other components are:
IMU (1402c), binaural recording device (1402e). In the space between the eyeglass and the eye, each side of the frame has two eye movement capturing cameras (1402f, 1402g), two IR
illuminators (1402h, 1402i), a rear facing (road facing) camera (1402j) that captures images of the scene in front of the driver, and a autorefractor (1402k) that is used to record in near real-time where the eye is focused. The autorefratctor faces the pupil and has its own IR source in-built, and projects a pattern on the eye. The cornea and the phakic lens of the eye together focus the pattern onto the fundus. The wavefront reflected from the fundus is sensed by a lenslet array in the autorefractor, and the wavefront is analyzed. The focal length of the eye's lens can then be deduced from this measurement since the cornea has a fixed focal length in an individual. The line of sight of the eye can be derived from the eye position data extracted when recording eye movement data. Combining this line of sight with the focal length of the lens provides information on the point in space the eye was fixated on. The road-facing camera on the frame captures video in real-time, and this can be combined with the eye fixation point to determine what object was being fixated on.
[0113] Fig 14c shows an embodiment of a frame mounted eye movement and ambient sound imaging system (upper figure), and the bottom view (lower figure) of this frame. Only one side of the symmetrically placed components are identified in the figure.
Prescription eyeglass (1403a) mounted on frame (1403) using a hard polymer clamp (1403b), IMU
(1403c, 1403d) and binaural recording device (1403e). In the space between these eyeglasses and the eyes, each side of the frame has two eye movement recording cameras (1403f, 1403g), two IR
illuminators (1403h, 1403i), and a autorefractor (1403k) that is used to record in near real-time where the eye is focused. Outside this space (that is, outside the eyeglasses), rear facing camera (road facing cameras) (1403j) that captures images of the scene in front of the driver.
These road facing cameras are connected to the frame by a transparent hard polymer U-shaped member (14031) to the main frame, the U-shaped members going around the eye glasses. If prescription eyeglasses are not required, then the U-shaped, member is not required, and instead, the road-facing camera can be attached directly to the frame, for example, just behind one of the eye-facing cameras.
The autorefractors in this embodiment do not face the pupil, but instead face the eyeglasses. The eyeglasses have an IR reflective coating applied on their inner surface (the surface closer to the eyes). This coating type can be a 100% IR reflectivity (wavelength specific to the light source used by the autorefractor) at around 30-60 degree angle of incidence. In effect, the eyeglasses act as mirrors at this wavelength. In another embodiment, the autorefractor and the eye imaging cameras can share the same IR illumination source, with the sources having a pattern also suitable for the autorefractor. The autorefractor records in almost real-time the focal length the eye's lens. As in the previous embodiment, this data can be combined with the eye fixation point to determine what object was being fixated on. In another embodiment (not shown), the system of fig 14c can be used without the autorefractor.
[0114] Any of the previously discussed frames mounted eye and sound imaging systems can be used with a reduced or increased number of components. For example, the frame could have one or more eye-facing cameras for each eye, with one or more IR illuminators. If needed, the frame can be made for imaging only one eye (for example, the dominant eye), the other portion of the frame being empty. The binaural recorders can be dispensed with if not required, and the same with the road-facing cameras and/or IMU sensors. In any of the embodiments of the binaural sensors, any of the previously disclosed binaural sensors can be incorporated into any of the frames of 14a-14c as long as they are of a size that can be mounted on the frame without causing inconvenience to the driver. Furthermore, the binaural sensors can be incorporated to other parts of the car or drive, including head rest, roof-mounting, on the driver's head.
[0115] Fig 15a shows a scenario of eye movement data being used to train a vehicle to improve its autonomy. It shows the first image of the video, but with 2.5 seconds (starting with time of the first image) worth of saccades and fixations overlaid on this image. A car is being driven by a human driver. The car being driven by the driver is not shown. What is visible to one of the outside facing cameras is shown in the figure. The driver's view is vignetted by the frame of the car, and what is visible to the driver are parts that are covered by glass (like the windshield and windows). As the driver is driving the car, the eye movement imaging system (dashpad mounted or head mounted or a combination) captures the eye movements of the driver. An image analysis system extracts data related to at least saccades and fixations, and optionally also data related to glissades, smooth pursuits, microsaccades, square wave jerks, drifts and tremors. Figure 15b shows saccades and fixations isolated from fig 15a for clarity. Saccades are shown by lines with arrows, fixations are shown as circles, the largest fixation circle being 600 ms, the smallest 200 ms. Time and geolocation stamps are gathered along with outside video, driver's eye movement video, and LIDAR. It should be appreciated that not all data may be available at all times. for example, during blinks, driving through tunnels, and poor weather conditions, but available data is recorded at all times. This data is saved in a AV's software. The saved data indicates to the AV
which parts of the segment of this path need to be analyzed carefully or with higher priority and appropriate actions taken whenever necessary. Much higher computational efficiencies can be attained if foveated portions of an image are analyzed instead of the entire image. Also, foveated regions can be processed for color information, while the peripheral vision can be analyzed for moving objects, flashing objects, and sudden movement, lending itself to much faster, accurate and efficient computation.
[0116] In the scenarios of figs 16-20, time and geolocation stamps are gathered along with all sensor data (as in fig 20) including outside video, driver's head/dashpad mounted outside facing camera video, driver's eye movement video, binaural audio, foot and hand sensor data, speedometer, RPM, wheel turning angle, weather (temperature, precipitation, visibility, humidify), LIDAR, radar, and ultrasound. Signatures are relative to each frame in a video, or a series of consequent frames of the video. The values of these events are recorded within each video frame (as metadata) or in a separate file (but with synchronized timestamps and/or position data) as multi-dimensional vectors that include timestamps, physical location (GPS/IMU), vehicle, outside environment and human sensor (shown in fig 22) data. It should be appreciated that not all sensor data may be available at all times. for example, when using a mobile phone to record sound and eye movement, binaural data will not be available, just a single microphone data. Or, the driver may be driving a vehicle that doesn't have all the sensor systems installed.

Absence of some of these sensor system doesn't take away from the fact that event signatures can still be extracted- although with a loss of robustness and possible increase in latencies.
[0117] Fig 16 depicts a scenario of a human driver driving a car in a city road with several intersections and light traffic. In this figure and its accompanying description, the car being driven by the driver is not shown, and all references to a driver relates to the driver of this car.
However other cars (1601, 1602) on the road are shown. The figure shows roads, buildings, and other features from the perspective of the driver. The A-beams of the car are not shown in the figure, only the area within the windshield. The car has human, outside environment and vehicle sensors (as listed in fig 20).
[0118] An active ambulance is nearby, but not yet visible to the driver because it is hidden by a building (1603). The ambulance's sirens can be heard, but its flashing lights are not yet visible to the driver because buildings are blocking the view of the perpendicular roads ahead. Sounds, unlike light, are not completely blocked by buildings and trees. It appears to the driver that the ambulance is on one of the cross-roads since the road ahead and behind are clear.
[0119] When the ambulance's sirens become audible and discernible, the driver removes his foot off the accelerator and moves it over the brake pedal, while saccading to the rearview mirror, driver-side sideview mirror, and left and right side in front to find out where the ambulance is.
This saccading pattern is repeated until the driver is able to aurally establish the origin of the sound as coming from the front. After this, the driver's saccades are directed towards that region in the front. As soon as the reflections of flashing lights (1604) of the ambulance are seen by the subject (reflections bouncing from buildings, road and trees), the brake pedal is depressed slightly (inversely proportional to how far ahead the ambulance's lights are).
The brake pedal is then continuously depressed to slow the vehicle to bring it to a rolling stop if and when the need arises. As soon as the ambulance exits the intersection (1605), the accelerator pedal is depressed to speed up the car if there are no other emergency vehicles following the ambulance. The binaural recording provides an extractable signature for the ambulance's siren. The human event occurrence detection scheme of fig 26a is used to detect that a human event has occurred in fig 16 since there is a foot release from the accelerator and movement over the brake pedal and also an associated aural event (ambulance siren) detected. Once a human event has been detected, the next step is to find the associated outside event that caused the human event to occur. The associated eye movement data is used to analyze the video images of the road ahead and behind (from road facing cameras) for detectable events. The image analysis is efficient because only the small parts of the images where the eyes are saccading and fixating are analyzed. The initial faint lights of the ambulance are detected in the video images. Critical features include flashing lights and specific colors of the light. This forms the process of event signature extraction of fig 26b. Extracted components include aural (siren sound), video (flashing light), and foot (slowing down of car). This is followed by the categorization, map update and training software update as shown in fig 26b. Several such instances under different conditions and from different drivers and geographical regions are similarly extracted and stored in the database.
The "ambulance"
event (in the form of a subroutine for emergency vehicle identification and reaction) can first be implemented in test vehicles. These test vehicles can be semi or fully autonomous. A variation of this scenario is when there is no light, just sound- which can be the case in crowded cities, for example. In such instances, the only the binaural signal is captured. When using non-binaural recording (mobile phone with a single microphone, for example), directionality will be lost, but a sound signature can still be extracted, and combined with other human and vehicle (outside and inside) sensor data.
[0120] Signatures from multiple instances of such ambulance appearances from different subject drivers can be used to form an averaged scenario signatures (including sound and light signatures) and AV response to an approaching ambulance. This can be a group of drivers in a region having similar flashing light schemes and sounds for ambulances, and also similar traffic rules regarding ambulances. Although one instance can be used for training, for improved accuracy, several of such events from several different paths driven by several different drivers can be acquired and used to train AVs. This specific subroutine is then fine-tuned by software self-learning (artificial intelligence) or by a (human) programmer or a combination. After several cycles of fine-tuning and testing, the subroutine can be implemented in non-trial AVs. Without this updated software, the AV would have continued without reducing speed significantly ¨ until an ambulance actually appeared.
[0121] [01] Fig 17 shows a scenario in which a small child is ambling towards the edge of the road of the opposite lane. A human driver is driving a car at around 36 km/hour speed on this narrow undivided road with buildings on either side. The driver sees the child (1701) emerging from behind a pillar (1702) without an accompanying adult, 100 meters ahead.
The edge of the pillar is 2 meters from the edge of the road. In reality, an adult is holding the hand of the child, but is behind the pillar and therefore hidden from the driver's view. The driver's eyes saccade to the child and form an ROI around the child (ROI-child), which includes checking for adults holding the child's hand, and tracking the child moving closer to the road, interspersed with saccades to the road ahead. The driver has now become alert, and increased hand grip and contact area on the steering wheel. The foot goes off the accelerator and over the brake pedal immediately. With the eyes unable to find an accompanying adult, and the child being about 70 meters ahead, brakes are applied to lower the speed from the initial 36 km/hour to 18 km/hour in a span of 1.5 seconds. As the eyes saccade to and fro between the REM-child (which is now about 60 meters ahead of the driver) and the road as the child inches closer to the road, the driver is still unable to spot an adult. The car is slowed from 20 km/hour to 5 km/hour in 2 seconds. The child is 1.5 meters from the edge of the road, and the car is about 50 meters from the child. The brake pedal is kept depressed in preparation for a complete stop to take place about 10 meters from the child. However, 30 meters from the child, the driver is able to see the adult holding the child's hand. The foot goes off the brake now. The adult (who has apparently seen the approaching car) restrains the child from moving forward. The driver presses on the accelerator pedal to slowly bring back the speed to 36 km/hour. This signature is captured and processed, and then filed in a "unaccompanied child approaching road" sub-category under main category "Child" (as listed in fig 25), this process is described later. From the foregoing, it can be seen that the driver was being over-cautious. He reduced the speed to 5 km/hr at 50 meters from the child, even though the child was 1.5 meters from the road. However, when data is gathered from a large population of drivers, the average speed at 50 meters from the child would be 20 km/hr, and can be used by an actual AV.
[0122] The human event occurrence detection scheme of fig 26a is used to detect that a human event has occurred in fig 17 since there is a sudden foot release from the accelerator and movement over the brake pedal and increase in hand grip and contact area on the steering, both with associated eye-movement to the side of the road and formation of ROIs and saccades/fixations around the child. Once a human event has been detected, the next step is to find the associated outside event that caused the human event to occur. Video images from cameras facing the road are analyzed using image processing, and the child is identified as corresponding to the eye movement data, as also the edge of the road. Critical features like lack of adult accompanying the child, and the spacing between the child and the road are stored as part of the signature. This forms the process of event signature extraction of fig 26b. This is followed by the categorization, map update and training software update as shown in fig 26b.
Several such instances under different conditions and from different drivers and geographical regions are similarly extracted and stored in the database. When the updated training software is used by an AV, and the AV encounters a similar "unaccompanied child approaching road"
scenario, it reduces speed and analyzes the ROT around the child at a higher priority, while reducing speed to about 20 km/hour by the time it gets to 50 meters of the child. Once the adult is detected, the sped is resumed to 36 km/hour. Without this updated averaged software, the AV

would have continued without reducing speed, and an accident could have probably occurred if the child was actually unaccompanied and entered the road. The additional benefit of using the updated software is that higher speeds can be maintained without being overly cautious, and rational speed decisions can be made depending on how the situation evolves. z
[0123] Fig 18a shows an aerial view of a scenario in which a human driver is driving a car (1801a) that is approaching a roundabout having three roads entering it. All references to a driver in this scenario relate to the driver of the car. A heavy truck (1801b) is also approaching the roundabout (1803). Both the car and the truck are traveling in the directions shown, and are 200 meters from the roundabout. The car is traveling at 80 km/hour and the truck slightly slower at 75 km/hour. The distance (1804) between the car's lane's entry point into the roundabout and the truck's lane's entry point is about 75 meters. While fig 18a shows this starting scenario. The truck is not slowing down as it is getting closer to the roundabout. The car has the right of way, but the driver is not sure if the truck will eventually stop. The truck appears in the driver's peripheral vision, and the driver makes a saccade towards the truck, and then slow tracks it (for about 3 seconds) as it approaches the roundabout. During this period, the driver's grip on the steering wheel and the contact area increase slightly. The foot goes off the accelerator, but does not move over the brake pedal. The driver then makes a saccade towards the roundabout to check if there are other vehicles in or about to enter the roundabout (vehicles inside the roundabout have the right-of-way), and observes that the roundabout is clear. The driver's eyes then quickly saccades to the truck to slow-track it for another 3 seconds. Since the truck is not slowing down, but continuing towards the roundabout, the driver's foot goes over the brake pedal and depresses it to halve the speed from 80 km/hour to 40 km/hour in 4 seconds. Fig 18b shows the perspective view of the scenario at this time, wherein the truck is about 40 meters from the roundabout and starts slowing down rapidly and the car has already entered the roundabout.
The car driver has been slow tracking the truck, and notices it is slowing down. The driver's foot goes off the brakes for 1.5 seconds, while the eyes saccades to the roundabout to check for entering traffic, and saccades back to the truck (which has almost come to a complete stop) and goes over the accelerator pedal and depresses to rapidly speed up to 60 km/hour and enter the roundabout. The eyes fixate on the truck while crossing the roundabout. The scenario beginning at fig 18a, proceeding through 18b, and ending after the car has exited the roundabout, is captured and a signature extracted and categorized under "Danger" (see signature categorization in fig 25 and related text).
[0124] Fig 19a shows a scenario where a driver is driving a car encounters a maintenance truck on the same lane replacing street lights. In fig 19a, the car is not shown, only the maintenance truck (1901) is shown. The truck has a flashing yellow light (1902), and an extended boom (1903) with a platform (1904) having a person on it. The car is 60 meters from the truck and traveling at 40 km/hour, and the truck is 40 meters from the intersection (1905). The car is on a 'right turn only' lane, and intends to turn right at the intersection. The driver sees the truck on the lane. The driver's eyes saccade to the truck body, then to the boom and the platform above, and then to the person on the platform. The eyes establish an ROI around the truck, boom and person, saccading around it, while the hand grip and contact surface area on the steering wheel increases.
The foot simultaneously goes off the accelerator and on to the brake pedal, slightly depressing it.
The eyes then saccade to the rearview mirror and sideview mirror, end of the road (which the driver notices is about 40 meters from the intersection), and then back to the truck. The car is slowed down to 15 km/hour over 3 seconds. The car is now 30 meters from the truck. The human driver instinctively decides to drive around the truck by switching to the other lane on the same side without expecting the truck to start driving away. After this, the driver switches back into the original lane. If the truck were parked at the intersection, then the human driver would have switched lanes and taken an alternate route, for example, going straight through the intersection.
The decision to switch lanes to get around the truck involved the eyes establishing an ROI
around the truck-boom-platform, and saccading and fixating within this region, and also to the rear/sideview mirrors and the intersection, deciding it is safe to switch to another lane and back again (while mentally noting that there is no traffic in rear/side view mirrors, and there is enough distance between truck and intersection). The signature of this event is captured (as described in the previous scenarios), and categorized under "Unexpected Objects" (see signature categorization in fig 25 and related text), under a sub-category of "Maintenance".
[0125] Fig 19b shows a scenario of a child on a bicycle on the pavement on the same side of the lane that a human driver is driving a car. Fig 19b1 shows eye movement data for the first 4 seconds of this scenario superimposed on a still image. Fig 19b2 shows just the eye movement data, while 19b3 shows an enlarged version of fig 19b2. The circles represent fixation points and time, the largest circle corresponding to a fixation time of 500 ms, while a majority of them are 150 ms. The straight lines represent saccades, with directions indicated by arrows. Over the course of this scenario, there is no other distraction in the foveal or peripheral vision, including no traffic lights or other traffic. The car is on the rightmost lane and 100 meters away from the child (1910), driving at 50 km/hour. There is no traffic on the road. The driver's eyes saccades to the child and the bike (1911), forming an ROI around it. The eyes-brain combination conclude that the bicycle is stationary, with both feet of the child on the ground, and the front wheel is close to the edge of the road. There are no adults accompanying the child (and therefore the child's actions can be more risky). The child appears very young in age, perhaps 4-8 years old, and therefore can perform unexpected moves, including riding the bike into the road without waiting for the car to pass, or stumbling and falling onto the road. Expecting this, the driver's grip and contact area on the steering wheel increases slightly, while the foot goes off the accelerator and goes over the brake pedal and depresses it to bring the speed down to 25 km/hour over 4 seconds, all the while saccading within the same ROI to detect unexpected actions of the child, except for one saccade to the end of the road and one slightly to the right of this point. The car is now 60 meters from the child. The child is closer, and the driver is able to confirm that the child is indeed very young, probably 4-6 years old. With no change in the child's pose (i.e. the child is well-balanced and stable, and not rocking the bicycle back and forth), the driver's apprehension level drops, but is still very cautious because of the age of the child, and drops the speed down to 15 km/hour in 4 seconds. The car is now 35 meters from the child. The driver halves the speed down to about 8 km/hour over 4 seconds, and is about 20 meters from the child.
The car proceeds at this very low speed until it is 5 meters from the child.
The driver then removes the foot from the brake and depresses the accelerator pedal to bring the speed to 40 km/hour in 3 seconds. The signature of this event is extracted and categorized under "Child" (see signature categorization in fig 25 and related text), under sub-category:
"Child on bicycle", sub-sub-category "unaccompanied child on bicycle" and a further, sub-category:
"unaccompanied child on bicycle at edge of the road".
[0126] The learning here is that the driver's reaction is proportionally related to the child's age, distance from the edge of the road (inverse relationship), absence of accompanying adults, and present speed of travel. These reactions include saccades around the ROI, grip and contact area on the steering wheel, reduction in speed (including the quantum of reduction, latency to starting the reduction process, distance from the child before the reduction is applied). In the AV
software, the image processing system processes these factors to form a response, including speed reduction. Without training, a traditional AV software will not prepare for evasive actions or reduce speed to account for the unexpected. The overall speeds are on the lower end for AVs compared to humans because they are cautionary all the time. Training AVs can make them faster, while helping build more logic and rationale to such situations. If a very small child on a small bicycle is being closely accompanied by an adult, then the image processing will identify the adult following the child's bike and become less cautionary. There are variations in such a scenario: for example, there is an adult, but the adult is 5 meters away from the child. Caution and speed reduction will become greater now. Automatic identification of such an "unaccompanied child on bicycle at edge of the road" scenario will become easier, efficient, and more comprehensive when data from a swarm of drivers is used. The collection of such scenarios will grow with time and become well-defined algorithms in the training software. Over time, variations of "kid on a bike" (like "kid on skateboard") can be added to the set of algorithms, particularly as the test-base grows. New but unidentifiable variants can be manually processed for scenario detection and response.
[0127] Fig 19c shows a scenario where a soccer ball rolls into a suburban road on which a human driver is driving a car. The car is traveling at 50 km/hour. The driver notices the ball (1921) entering the road 50 meters ahead from a point (1920a) behind a tree.
The driver's eyes saccade to the ball and slow tracks it for about a second. The direction of the ball is indicated by broken line arrow (1920b). After confirming that it is a ball rolling into the road, and anticipating the possibility of a child following the ball into the road without watching out for traffic, the driver's grip and contact area on the steering wheel increases slightly. The foot goes off from the accelerator pedal and onto the brake pedal without depressing it. The eyes stop tracking the ball, but instead saccade to the point from where the ball came from, and establishes a ROI around that area. After the car gets to 20 meters of point 1920a, the area around it becomes clearer (not hidden by trees or shrubs). The eyes saccade to a point that is a backward extension of the arrow and is 5 meters from the road, and establishes an ROT there. The car has meanwhile slowed to 45 km/hour (because the accelerator pedal was not depressed). Seeing no person present there, the driver assumes no one is following the ball, and returns the foot to the accelerator pedal 5 meters from point 1920a to return to a speed of 50 km/hour. The signature of this event is then extracted and categorized under "Child" (see signature categorization in fig 25 and related text) rather than "Unexpected Objects". A non-human navigating a vehicle will notice the ball rolling across the road, but will continue if the ball has exited the lane. A human would expect a child to appear unexpectedly following the ball. The eye movement pattern will be saccading to the ball, smooth pursuit for a short time and saccading to the region from where the ball might have originated from. Depending on the vehicle speed and distance to the ball, the foot might move away from the accelerator pedal and move over to the brake pedal at different speeds, and might depress it very little (or not at all) or a lot. However, the basic signature underlying such variations will have similar patterns.
[0128] Fig 19d1-19d3 shows the scenario of a kangaroo that is about to enter a single-lane rural highway on which a human driver is driving a car at 100 km/hour. The sun has already set an hour back. The car has its high-beam lights on. The driver has been driving in a relaxed manner, with just two fingers and a thumb lightly touching (and not pressing down hard) the steering wheel. One hundred and fifty meters ahead, the driver sees an object moving in his peripheral vision. His eyes saccade to the object, and notices it is a 1.5 meter tall kangaroo (1931). Both of the driver's hands grab the steering wheel, gripping it (medium force) with all fingers. The foot simultaneously releases the accelerator pedal and moves over the brake pedal, depressing it with medium firmness. The car is now 100 meters from the kangaroo and moving at 70 km/hour. The driver's eyes are slow tracking the kangaroo's eyes (which are glowing due to the high beam of the car) as it hops into the driver's lane. An experienced driver, he knows that the kangaroos move in mobs, and there might be more of them following the one that just got on the road. He also knows that kangaroos often stop and stare at a car's blinding lights, sometimes even turning around from the middle of the road or right after just crossing it. He continues pressing down on the brake pedal to slow the car down to 50 km/hour, while forming an ROI
around the kangaroo (but fixated on its glowing eyes whenever it looks at the car), slow tracking it whenever it hops.
The kangaroo hops away into the far side of the road just as the car passes it. The signature of this event is extracted and categorized under "Danger" (see signature categorization in fig 25 and related text), under sub-category: "Animals", sub-sub-category "Kangaroo".
Incidents of kangaroos on (or by the side of) the road are recorded and signatures extracted. There will be numerous variations of this signature. For example, the kangaroo stopped in the middle of the road and would not budge, or it turned around and hopped back into the car's lane after reaching the divider line, or there were more kangaroos following the original one.
However, common aspects will include slow-tracking of hopping, or fixation on the kangaroo, all of which can be extracted from eye movement video, road facing camera video (IR and/or ambient light), and long range radar data, and combined with hand and foot sensor data. Pattern analysis can be used to identify both the kangaroo and as well as bright spots (eyes) on the roads and shoulders in the night in rural or kangaroo-prone roads. Smooth pursuit when looking far away from the side of the road indicates the kangaroos are not close to the road, and therefore there is no danger. The gait of kangaroos varies with their speed. When they are just ambling or feeding, they can use all their limbs. While running at low speeds, they are on their hind limbs, but not hopping very high.
When running fast, they are on their hind limbs and hopping much higher. The gait of kangaroos is also distinguished from other animals like cows because the preference of kangaroos to use hind limbs. This aspect (of kangaroos preferring hind legs for locomotion) can be exploited by the outside facing video image analysis to distinguish kangaroos from other animals. With numerous such events being captured under different conditions, a robust automated kangaroo detector and countermeasure subroutine can be formed. Capturing the appearance (size, shape, color) and gait of different animals under different conditions allows the extraction of signatures unique to each animal, and categorization under appropriate animal sub-categories. It will be appreciated that the signature extraction schemes in the various scenarios in this disclosure not only capture human actions and reactions to specific events, but they also indirectly capture the memories and experience of the human drivers, along with human logic, deduction, rationality and risk-mitigation since these are the factors that cause drivers to act and react a certain way.
For example, in the case of the driver just discussed above, who knows from experience and memory that kangaroos move in mobs, and there might be many crossing the road, and that kangaroos have a tendency to linger on the road or hop back into the road after seeming to try to cross it. Using such signatures will reduce or negate the need for these actions and reactions of the driver to be programmed into AV software by a human programmer. Such signatures carry a wealth of human knowledge, experience and logic accumulated over years and spread among a wide variety of geographies and populations, and their trade-offs with rationalization and risk management, allowing safe, fast, efficient and pleasant transportation. As societies transition towards non-human vehicle operators, all this is saved to signatures for use by AVs.
[0129] Figs 19e, 19f show two scenarios of a dog on leash by the side of the road and walking towards the road on which a human driver is driving a car. In the first instance (fig 19e), the dog (1941) is close to its human (1940), and the leash (1942) is sagging. In the other case (fig 191'), the dog (1941a) and its human (1940a) are 4 meters away, with the leash (1942a) taut and the dog appearing to be tugging on the leash. In the first instance, the driver will not observe a possible danger, and will continue driving normally. In the second case, the driver will slow down, expecting the possibility that the leash would give way or pull the human along with it as it runs into the road. The driver's eyes will saccade to the dog, form an ROT
around it (and notice its body to check its body size and whether its body pose indicates tugging), then trace the leash and form a simple ROT around the human (and check if it is an adult, body pose to see how much control the human has). Depending on the outcome, the driver slows down or continues at the same speed, with corresponding hand grip/contact area, foot positions.
[0130] Figs 19e1, 19f1 show corresponding eye movements for figs 19e and 19f.
The eye movement overlay is shown separately (for the sake of clarity) in fig 19e1a and fig 19fla, and also show added fixation details. The eye movement overlay in figs 19e1, 19ela starts from when the driver notices the dog in his peripheral vision and saccades to it, and ends 2 seconds after this. It should be appreciated that most eye movements are not conscious. Saccade directions are indicated by arrows, fixations are indicated by circles, with the smallest circle being about 100 ms, and the largest one 350 ms. The eye movement overlay in figs 19f1, 19f1a starts from when the driver notices the dog in his peripheral vision and saccades to it, and ends 3 seconds after this. The dog and the human form separate ROIs in fig 19f1, but are a single ROT
in fig 19e1. Signatures are extracted and categorized under "Danger", sub-category "Animal", sub-sub category "Dog", which can have sub-sub-sub categories "large dog", "small dog", "seeing dog". The human event occurrence detection scheme of fig 26a is used to detect that a human event has occurred in fig 19f1a since there is a sudden foot release from the accelerator and movement over the brake pedal and increase in hand grip and contact area on the steering, both with associated eye-movement to the side of the road and formation of ROIs and saccades/fixations around the dog-human combination. Once a human event has been detected, the next step is to find the associated outside event that caused the human event to occur. Video images from cameras facing the road are analyzed using image processing, and the dog-human pair are identified as corresponding to the eye movement data, as also the edge of the road.
Critical features like the spacing between the dog and human, size of the dog, leash curvature (lack of), human pose, and distance to edge of road are stored as part of the signature. This forms the process of event signature extraction of fig 26b. This is followed by the categorization, map update and training software update as shown in fig 26b. Several such instances under different conditions and from different drivers and geographical regions are similarly extracted and stored in the database. When the updated training software is used by an AV, and the AV encounters a similar "big dog at edge of road tugging on leash held by human" scenario, it reduces speed and becomes cautious (analyzes the ROIs at a higher priority). Without this updated software, the AV
would have continued without reducing speed significantly, and an accident could have probably occurred if the leash were to break or slipped out of the human's hand, or the dog, dragging its human, had entered the road.
[0131] Fig 20 shows prior art vehicles with at least some level of autonomy.
The autonomy conferring elements can be divided into two components; hardware and software.
The hardware module has two layers: sensor and vehicle interface layers. The sensor layer has environment sensors and vehicle sensors. The vehicle interface layer has steering, braking and acceleration systems. The software component has four layers: perception, reaction, planning, and vehicle control. The perception layer has four components: localization, road detection, obstacle avoidance, and pose estimation. The sensor layer gathers information about the environment around the vehicle, for example, using its cameras, lidar, radar and ultrasonic sensors. The data typically relates to obstacles, surrounding traffic, pedestrians, bicycles, traffic signs and lights, roads and lanes, paths and crossings, GPS coordinates. Vehicle sensors gather data on vehicle speed, acceleration, turning, and direction vectors. Data from the environment and vehicle sensor is fed to the software component. The software component then analyses this data to determine its present position relative to a map, what actions needs to be performed by the vehicle, planning for future actions, and how the vehicle has to be controlled. The software component then send control instructions to the vehicle interface layer.
[0132] Fig 21 shows an embodiment of a vehicle with improved autonomous abilities and functionalities. The vehicle has an enhanced set of environment and vehicle sensors, the details of which are shown in fig 22. In addition, the hardware component has human sensors. The hardware module (2101) has two layers: sensor (2105) and vehicle interface (2106) layers. The sensor layer has environment sensors (2105a), vehicle sensors (2105b) and human sensors (2105c). The vehicle interface layer has steering (2106a), braking (2106b), acceleration (2105c), signaling (2106d) and communication (2106e) systems. The software component (2102) has four layers: perception (2107), reaction (2108), planning (2109), and vehicle control (2110). The perception layer has four components: localization (2107a), road detection (2107b), obstacle avoidance (2107c), and pose estimation (2107d). The sensor layer gathers information about the environment around the vehicle, for example, using its cameras, lidar, radar and ultrasonic sensors. The sensor layer additionally gathers data about the vehicle operator's eye movements, foot position, contact area and grip of the hands on the steering wheel. Data from the environment, vehicle and human sensors is fed to the software component. The software component then analyses this data to determine its present position relative to a map, what actions needs to be performed by the vehicle, planning for future actions, and how the vehicle has to be controlled. The software component then sends control instructions to the vehicle interface layer.
[0133] Fig 22 shows details of an enhanced set of environmental sensors that include human sensors. Environmental sensors (2200) include sensors to sense the environment outside the vehicle (2210), sensors to sense vehicle functioning (2230), and human sensors (2250). Outside environment sensors (2210) include: visible cameras (2211) to capture visible wavelength images outside the vehicle, including front, rear and side facing cameras, infrared cameras (2212) to capture 360 degree images in the infrared wavelength. Lidars (2213) are time of flight distance measurement with intensity sensors using pulsed lasers in the 0.8-2 micron (infrared) wavelength range. Lidars provide a 3D map of the world around the vehicle, including distances to objects. Radars map (2214) the position of close by objects, while sonar (ultrasonic) sensors (2215) detect nearby objects. Ferromagnetic sensor (2216) detects ferromagnetic objects, particularly those on the road, including buried strips. GPS (2217) use global positioning satellites to determine the vehicles position. Other environment sensors include fog (2218), snow (2219) and rain (2220) sensors. Blinding (2221) sensors detect light that is blinding the driver, including sun low on the horizon and high-beam headlights from vehicles coming from the opposite direction. Vehicle sensors (2230) sense the vehicles actions, performance and instantaneous position, and include sensors for measuring current brake force (2231) and steering angle (2232), detection of turn signals (2233), status of light (whether headlights are turned on/off, and high beam) (2234), RPM (2235), odometer (2236), speed (2237), handbrake position (2238), cruise control settings (2239), ABS activation (2240), readings of the vehicles inertial measurement units (IMU) (2241), and vibration sensors (2242) that detect unusual vibration of the vehicle, for example, from rumble strips, alert strips, speed bumps, gravel, and potholes.
Human sensors (2250) include eye movement sensors (2251), foot position sensors (2252) and hand grip and contact area on steering wheel sensors (2253), and aural (2254) sensors.
[0134] Fig 23 shows the different kind of human sensors (2250) used, and the events they contribute to recording. Eye movement sensors (2251) detect the following events: saccades (2251a), glissades (225 lb), fixations (225 lc), smooth pursuits (2251d), microsaccades (2251e), square wave jerks (22510, drifts (2251g) and tremors (2251h). Foot movement sensors detect 3 aspects: xyz position of brake pedal (2252a), xyz position of acceleration pedal (2252b), and xyz position of the foot (2252c) of the driver. See fig 10b and fig 10c (and associated text) for details on quantities measured. The combination of 2252a, 2252b and 2252c helps make a determination of where the foot is with respect to the brake and accelerator pedals, and whether either one of them are being depressed, and to what extent they are being depressed. Hand contact area and grip sensors detect the hand contact area and grips on the steering wheel. The left hand contact area (2253a) and its grip force (2253b), and the right hand contact area (2253c) and its grip force (2253d) on the steering wheel are sensed and measured as discussed under fig 9a-9e (and associated text). Aural sensors (2254) helps detect various events having associated sounds like:
emergencies 2254a (police, ambulance and other emergency vehicle sirens), dangers 2254b (sounds of wheels screeching, honking by other vehicles etc), alerting sounds (2254c), warning sounds 2254d (for example, police using handheld loudspeakers for warning), Doppler detection 2254e (for example, to detect if a police siren is approaching the vehicle or receding away), accidents 2254f (sounds of crashes, fender benders, thuds). Aural events also include normal ambient sounds outside the vehicle (2254g) and inside the vehicle (2254h) (which in essence means no abnormal events are occurring) and directionality 2254i (direction from which a particular sound is coming from).
[0135] Fig 24 shows an event identifying module. This module uses the data from the sensors in fig 23 to identify and extract events. It should be noted that the normal outside ambient (2254g) and normal inside ambient (2254h) have no events associated with them, and so are not used by the event identifying module except as reference data (since there is always road, engine and passing traffic noise, and these either have to be subtracted from event data or remain unused if there is no event). Although all the sensors are continuously sensing, events do not occur all the time, but occur according to the scheme of fig 26a. The event identifying module (2360) includes 3 sub-modules: outside environment event identifying module (2310), vehicle sensor logger and identifying module (2330), and human event identifying module (2350). Each of these have their own sub-modules. The outside environment event identifying module (2310) has the following sub-modules: visible cameras feature identifying module (2311), IR cameras feature identifying module (2312), Lidar feature identifying module (2313), long (2314a), short (2314b), medium (2314c) range radars feature identifying module, ultrasonic feature identifying module (2315), ferromagnetic object feature identifying module (2316), GPS logger (2317), fog density logger (2318), snow visibility logger (2319), rain visibility logger (2320), and blinding light (sun, high beam) logger (2321). The vehicle sensor logger and identifying module (2330) has the following sub-modules: brake force logger (2331), steering angle logger (2332), turn signal logger (2333), headlight logger (2334), RPM logger (2335), distance logger (2336), speed logger (2337), handbrake logger (2338), cruise control logger (2339), ABS logger (2340), inertial measurement unit (IMU) logger (2341), and vibration logger and feature identifier (2342).
The human event identifying module (2350) has the following sub-modules: eye movement event identifying module (2351), foot event identifying module (2352), aural event identifying module (2354), and the hand event identifying module (2253).
[0136] Fig 25 shows the categorization of event signatures (and their priorities) so that they can be stored, recalled and used appropriately. It will be appreciated that the priorities are not in any particular order. For example, priority B can be made the highest priority in an AV's software.
The categorization process can use several variants. For example, it can be based on eye movements correlated with other human, vehicle, and outside sensors. For example, saccades to a point, fixation, and return saccades to that point followed by cautious slowing down indicates a possible unsafe situation. However, a saccade to a point and immediate slowing down indicates a more immediate danger. Such a scenario can be accompanied by rapid checking of the side-view and/or rear-view mirrors in anticipation of performing a cautionary action like lane change or complete stop. When analyzing this scenario for extracting training information, if there is any confusion as to what feature the eye had saccaded to because multiple objects were present in the line of sight but the objects are at different depths, autorefractor information (if available) of focal length of the eye's lens can be used determine what was fixated on. From this scenario, several concepts can be extracted, including the appearance of what features relative to the lane on the road require caution, judged distance to the feature, slow-down and braking profile depending on what the feature is, cautionary, defensive and evasive actions to be performed.
[0137] The event signatures include: Danger (priority A) 2501, Child (priority B) 2502, Efficiency (priority C) 2503, Courtesy (priority D) 2504, Special occasions (priority E) 2505, Weather related (priority F) 2506, New traffic situation (priority G) 2507, Unclear situation (priority H) 2508, Startled (priority I) 2509, Unexpected objects (priority J) 2510, Unexpected actions of others (priority K) 2511, Sudden actions of others (priority L) 2512, Comfort levels-speed, distance (priority M) 2513, Environment (low-light, sun-in-eyes, high-beam) (priority N) 2514, and Legal (priority 0) 2515.
[0138] Event signature Danger (2501) relates to events that are dangerous, with potential for human injury or property damage. For example, consider a scenario where a potential accident that was averted because a heavy truck entered a road without yielding. The event signature can include eye movements (like saccades, fixations, slow tracking), binaural recording, along with hand and foot sensor data, all combined with road facing video of a situation where a collision with this truck could have potential occurred, but the driver took evasive action to avert this accident.
[0139] Event signature Child (2502) relates to events associated with a child, either averting an accident, or driving cautiously in expectation of an unpredictable, illegal or unexpected action by a child. For example, consider a scenario in which potential injury to a child was averted. The child, along with a caregiver, are walking straight ahead along a sidewalk of a road. The driver on the road notices the child turning back and looking at a bird on road's divider. The driver slows down expecting the child to cross the road to pursue the bird. The caregiver is unaware of what is going on. As expected, the child lets go of the caregiver and darts across the road. The driver is already slowing down and completely alert, and is prepared to stop, and does stop a meter from the child. Eye movement data, hand and foot sensor data, and forward looking video are all analyzed to extract relevant information and formulate an event signature.
[0140] Event signature Efficiency (2503) relates to events that help in improvising efficiency of transportation. This can be, for example, taking the shortest route, or taking the fastest route, or avoiding to the contribution of traffic congestion on a particular segment of a path. These scenarios are typical in congested portions of large cities. The driver takes side routes which are slightly longer, but helps get to the destination faster, and also helps prevent congestion at a particularly notorious segment.
[0141] Event signature Courtesy (2504) relates to actions of the driver that lend to politeness, civility and courtesy. This can be, for example, the driver slowing down to let another car enter the lane. In this situation, there is no other need or indicator for slowing down, including legal (traffic signs or laws), traffic conditions or other event categories. Eye movement data, aural data, hand and foot sensor data, and forward looking video are all analyzed to extract relevant information and formulate an event signature.
[0142] Event signature Special Occasions (2505) relates to non-normal occasions, and the driver's response to it. For example, traffic diversions are in place for a popular tennis match.
Roads approaching the venue have traffic diversion signs. However, these signs are road-side moving/scrolling display type. Such signs are not in the database of regular traffic signs. The driver follows these diversions, although this route is not the optimal one as per the map of the region. In ordinary circumstances, this action by the driver will be deemed inefficient and scored low. However, if the time-period for the segment of the path has already been indicated as Special Occasion, and the driver follows the diversions, then the actions of the driver will be used to extract an event signature. Such a signature can include: saccading to the roads-side display, which becomes a new region of interest (ROT), and saccades/smooth pursuits following the scrolling/moving letters within this ROI, while saccading back and forth to the traffic ahead, slowing down to read the signs (foot movement), gripping the steering wheel a little tighter.
[0143] Event signature Weather Related (2506) relates to environmental (local weather) characteristics that cause a driver to change driving characteristics. For example, during a first rain, roads becomes slippery, and an experienced driver will slow down much more than the usual when turning. During subsequent rains, the magnitude of slowing down will reduce. As another example, on a rainy day with wet and slippery roads, the driver will maintain a longer following distance, be more vigilant when traffic is merging (foot is more often hovering over the brake, with a lot more alternating acceleration and braking, while the hands are firmly gripped on the steering wheel, and there are a lot more saccades towards adjacent lanes).
[0144] Event signature New Traffic Situation (2507) relates to driver's behavior during changed traffic situations. This can include accidents ahead, lane closures, certain segments being converted to one-way roads. These situations will generally be a surprise to drivers. Their response to these situations will deviate from the normal, and the routes they take will vary from what is required by a map. Hand and foot sensors will detect some indecisiveness (unusual slowing down, foot off the accelerator and hovering over the brake, with intermittent pressing of the brake pedal, both hands on steering), while eyes will register regions with unusually slowing traffic (saccades to various portions of oncoming as well as on road traffic) which is confirmed by forward looking camera video.
[0145] Event signature Unclear Situation (2508) relates to situations when the driver is not sure of what to do next. For example, when lane markers on roads are faded, drivers traversing that segment of the path after a long time will be confused as to the lane boundaries. This can translate into the foot getting off the accelerator and hovering over the brake pedal without depressing it, even though the speed limit for that segment is much higher.
Other examples are: a situation when traffic lights are malfunctioning, or when another car has turned on turn indicator but is not entering the lane the driver is on. Lack of clarity in these situations can be traced from saccades to and from different ROIs, hand grip pressure and foot position.
Aural sensors may not detect any abnormality in ambient sounds.
[0146] Event signature Startled (2509) relates to an event in which the driver is startled. In such a situation, the driver becomes alert instantly. The hand-grip tightens instantly, with more number of fingers and more surface area of the palms making contact with the steering wheel.
The foot instantly gets off the accelerator and moves over the brakes, usually depressing the brakes at least slightly. Eye movements will indicate rapid saccades between very few ROls. An example is when a truck behind sounds its air-horn unexpectedly. Another example is a very small bird flying across the road right in front of a car (for example, bird is entering the road 5 meters ahead when the car is traveling at 80 km/hour), startling the driver.
The bird has no potential to damage the car. There will be a sudden foot movement from the accelerator to the brake, instantaneous grip and more contact area on the steering wheel, a saccade to the bird and then a very rapid smooth pursuit tracing the bird as it flies away, the steering wheel grip-force relaxing almost instantly but slower than at the beginning of this event (when the bird was first sighted) and the foot going back to the accelerator. This example event lasts around a second or two.
[0147] Event signature Unexpected Objects (2510) relates to an event in which an unexpected object appears to the driver. In such a situation, the driver becomes alert gradually (for example, as the object comes comes closer and its visual clarity increases). The hand-grip tightens gradually, with more number of fingers and more surface area of the palms making contact with the steering wheel as the object comes closer. The foot gets off the accelerator and moves over the brakes gradually. Eye movements will indicate a rapid saccade to the object, and then fixations and saccades within this region, and then a saccade to the rear-view or side-view mirror, and then a saccade back to and within the object ROT. An example is a large bird hopping across the road 100 meters ahead while the vehicle is traveling at 60 kn/hour.
The bird has no potential to cause major damage to the car. There will be a slow foot movement from the accelerator to the brake (which is not depressed) while a saccade to and within the ROI that defines the bird. This is followed by a slow smooth pursuit as the bird hops away from the road, the steering wheel grip force relaxing and the foot going back to the accelerator. This example event lasts over 3 seconds. Another example is pieces of shredded tire on a highway appearing starting 200 meters ahead while traveling at 100 km/hour.
[0148] Event signature Unexpected Actions of Others (2511) relates to events that are dictated by the actions of other vehicles. For example, when a car in front travels at 60 km/hour on a highway marked 100 km/hour, the drive is forced to slow down. Such an event is usually accompanied by saccades to the object in front, then to the rear-view mirror and then side-view mirror, all the while the foot has moved from the accelerator tot he brake and the steering wheel grip has tightened slightly along with a greater contact area. The driver is not startled, nor is the car in front an unexpected object.
[0149] Event signature Sudden Actions of Others (2512) are events that where the actions of other vehicles on the road lead to a driver performing a reflexive or conscious action. For example, when a vehicle in an adjacent lane swerves very slightly (but stays within its lane) into the lane of a driver, the driver swerves instantaneously away, and then slows down slightly. Eye movements will indicate a sudden saccade as the swerving vehicle enters the peripheral vision.
The saccade is followed by almost instantaneous foot movement away from the accelerator and onto the brake, which is depressed (there is no hovering over the brake, the foot depresses it immediately), while hand-grip and contact-area values increase instantly. The time period for this example event is about 0.5 seconds.
[0150] Event signature Comfort levels (2513) are event signatures surrounding drivers' attempts to adjust driving parameters to suit their comfort levels. This can be, for example, adjusting following distance, speed, lane position, preference for a longer route rather than taking a much shorter but crowded rote, or avoiding driving close to very large vehicles.
These events are typically much longer in time frame, with most sensor readings spread over a larger time (slower foot motions, grip on steering wheel is lighter and has less contact), including slower speeds and higher latencies for saccades, near-absence of microsaccades, and very low amplitude glissades.
An example is when a driver driving on a divided multi-lane highway with sparse-traffic encounters a long segmented trailer (carrying two levels of cars) ahead. The driver is uncomfortable driving behind its trailer, and prepares to get ahead of it by switching lanes and merging back. Slow saccades are directed to the rear-view and side-view mirrors, and a gradual speeding up of the car (foot stays on the accelerator since there was no prior braking for several minutes before of the start of this event) occurs. The steering wheel is gripped a little tighter than before (the previous grip was of very low value, and the contact was only single hand and three fingers, the present grip becomes slightly tighter and with two hands and more fingers engaged).
Saccades to the trailer, rear-view and side-view mirrors can all be, for example, one to two seconds apart during the lane change procedure. After the lane change and getting ahead of the trailer (for example, after 15 seconds), switching back to the original lane involve slow saccades = mostly directed to the rear-view and side-view mirrors.
[0151] Event signature Environment (2514) relates to driving behavior events that are affected by the environment, examples are low-light levels, sun straight ahead and low on the horizon, high-beam lights of oncoming traffic. When any of these events happen rapidly or unexpectedly, the driver slows down, maintains a longer following distance, is more cautious, all of which mostly translate to foot hovering over or depressing brakes, tighter grip and higher contact area on steering wheel, without affecting saccades, glissades and microsaccades.
[0152] Event signature Legal (2515) relates to a driver's action while following legal guidelines.
For example, a driver stopping the car at the instruction of a police officer waving to the driver to pull over, or giving way to a ministerial motorcade, or pulling over for a random roadside breath test. These vents are not routine in any segment of a path, and may not happen to every driver on that segment. They can appear slightly similar to stopping at traffic lights, but are distinguishable because there are no traffic lights on the corresponding map. These events can be accompanied by the driver pulling away from the road and onto a side or a non-road area.
They can also be a general slowing down, with slow tracking of vehicles on the driver's side (faster traffic is on the driver's side).
[0153] Fig 26a shows a human event occurrence detection scheme, while fig 26b shows how this detection scheme feeds data into analysis scheme to extract signatures and use it to train AVs.
This scheme is used to make a determination as to when an outside event has occurred. The sensors are continuously capturing data. However, not all of this data eventually goes towards training an AV. Specific events occurring on the outside of the vehicle are correlated with human sensor data vehicle sensor data. Eye movement event and aural events are classed as primary human events, while foot events and hand events are classed as secondary human events. When at least one each of primary and secondary human events have occurred, there is a possibility that this was caused by or in anticipation of an outside event. In fig 26b, these human events are compared to the pre-existing map to confirm if the human events correspond to what an outside event, and if there is no correlation, no outside event has occurred. If there is a correlation, then there is an expectation that an unusual outside event (outside the car) has occurred to which the driver is responding. For example, on a divided highway with sparse traffic, drivers might increase speed when they realize they are driving below the speed limit, or decease speed when the speed has increased over the speed limit. However, there was no outside event that caused these actions, and therefore no correlation between the human events and what is happening outside the car. Similarly, when following a routine path home from their workplace, drivers will have the usual patterns of saccades, glissades, microsaccades, fixations, hand and foot sensor readings, and similar aural recordings. In these cases, an unusual outside event has not occurred to cause a change in their normal driving pattern.
[0154] In fig 26a, data relating to eye movement (2351), aural (2354), foot (2352) and hand (2353) are fed to eye movement event comparator (2601), aural event comparator (2604), foot event comparator (2602) and hand event comparator (2603), respectively. The comparison is between the respective events at time T and time T+AT, where AT is a small increment in time.
This comparison helps determine whether a change in a human event has occurred in the time period AT. Thresholds can be set to determine what constitutes a change. For example, a 25%
increase in hand contact area on the steering wheel and/or a 50% increase in total grip force on the steering wheel can be set as the minimum for triggering a change-determination. Similar threshold settings can be used for other human events. The thresholds can be tailored for individuals, path locations (for example, rural versus urban), male/female drivers, type of vehicle being driven, time of day and other factors. If no change has occurred, the comparison continues starting with the next T, where the next T=T+AT. If a change has indeed occurred, then a check is made (2610) to see if at least one each of primary and secondary human events have changed for this time period. If the answer to this in the negative, then the determination is made that no outside event has occurred (2611). If the answer is affirmative, then an outside event has probably occurred (2612), and the human events are compared (2613) with the map corresponding to the path segment for the same time period T to T+AT. The results of this comparison are shown in fig 26b. It should be noted that while all these comparisons are going on, each of 2351-2354 are continuously feeding data to each of 2601-2604, respectively. This process continues irrespective of the outcomes at 2601-2604 and 2610.
Regarding eye movements, it should be noted that tremors, microsaccades and drifts can be used as alternatives or to augment fixation detection. Similarly glissades can be used as alternatives or to augment saccade detection, or for detecting the end of a saccade.
[0155] Fig 26b shows event signature extraction, categorization, map update and training software update by using human event data from fig 26a after confirmation that an outside event has probably occurred. The human event is compared to the corresponding map segment to see whether this was an expected event. For example, if the map indicates that there is a traffic light for the segment corresponding to when a driver stopped the car (saccades to the traffic light above and ahead of the car, hand contact area increased slightly, foot off the accelerator and over the brake and slow depressing of brake to come to a complete stop), then there was probable cause for the car to have stopped on the road. Data from the vehicle sensor logger and identifying module (2330), outside environment feature identifying module, and human event identifying module corresponding to this time segment (T to T+AT) are captured and stored in a data vector. A signature is extracted from this data vector. This forms the signature of the outside event that has occurred which caused the human to act or react a certain way.
This signature is compared to the existing signature database to see if it is a known signature, i.e, a similar event has occurred in the past. If it is a known signature, the signature's count is incremented in a user section of the map (the main map is not altered). If this is an unknown signature, then the signature is categorized under the scheme of fig 25 as to belonging to one of 2501-2515. This signature is then added to the appropriate category in the signature database, and also added to the user section of the map. The AVs training software is then updated.

Claims

Claims
1. A training system consisting of an imaging device configured to acquire images of an eye of a subject, an image analysis system configured to extract data from said images, wherein the data includes eye movement events associated with at least one of saccades, glissades, microsaccades, fixations, smooth pursuit, dwells, or square-wave jerks, wherein the data is used to train vehicles.
CA2986160A 2017-11-20 2017-11-20 Training of vehicles to imporve autonomous capabilities Abandoned CA2986160A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA2986160A CA2986160A1 (en) 2017-11-20 2017-11-20 Training of vehicles to imporve autonomous capabilities

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CA2986160A CA2986160A1 (en) 2017-11-20 2017-11-20 Training of vehicles to imporve autonomous capabilities

Publications (1)

Publication Number Publication Date
CA2986160A1 true CA2986160A1 (en) 2019-05-20

Family

ID=66628961

Family Applications (1)

Application Number Title Priority Date Filing Date
CA2986160A Abandoned CA2986160A1 (en) 2017-11-20 2017-11-20 Training of vehicles to imporve autonomous capabilities

Country Status (1)

Country Link
CA (1) CA2986160A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110488814A (en) * 2019-07-30 2019-11-22 深圳市前海胡桃科技有限公司 A kind of automatic Pilot method, apparatus and terminal device
CN111459159A (en) * 2020-03-16 2020-07-28 江苏理工学院 Path following control system and control method
CN111915898A (en) * 2020-07-24 2020-11-10 杭州金通科技集团股份有限公司 Parking monitoring AI electronic post house
CN113859253A (en) * 2021-11-24 2021-12-31 吉林大学 Real-time estimation method for quality in vehicle driving process
US20220032883A1 (en) * 2020-08-03 2022-02-03 Toyota Jidosha Kabushiki Kaisha Driving assistance device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110488814A (en) * 2019-07-30 2019-11-22 深圳市前海胡桃科技有限公司 A kind of automatic Pilot method, apparatus and terminal device
CN111459159A (en) * 2020-03-16 2020-07-28 江苏理工学院 Path following control system and control method
CN111915898A (en) * 2020-07-24 2020-11-10 杭州金通科技集团股份有限公司 Parking monitoring AI electronic post house
US20220032883A1 (en) * 2020-08-03 2022-02-03 Toyota Jidosha Kabushiki Kaisha Driving assistance device
US11752986B2 (en) * 2020-08-03 2023-09-12 Toyota Jidosha Kabushiki Kaisha Driving assistance device
CN113859253A (en) * 2021-11-24 2021-12-31 吉林大学 Real-time estimation method for quality in vehicle driving process
CN113859253B (en) * 2021-11-24 2023-03-07 吉林大学 Real-time estimation method for mass in vehicle driving process

Similar Documents

Publication Publication Date Title
US10769463B2 (en) Training of vehicles to improve autonomous capabilities
AU2024202759A1 (en) Autonomous Vehicles and Advanced Driver Assistance
CA2986160A1 (en) Training of vehicles to imporve autonomous capabilities
AU2018267553B2 (en) Systems and methods to train vehicles
JP7226479B2 (en) Image processing device, image processing method, and moving object
JP7155122B2 (en) Vehicle control device and vehicle control method
AU2018267541A1 (en) Systems and methods of training vehicles
US7245273B2 (en) Interactive data view and command system
CN106663377B (en) The driving of driver is unable to condition checkout gear
EP1721782B1 (en) Driving support equipment for vehicles
JP7099037B2 (en) Data processing equipment, monitoring system, awakening system, data processing method, and data processing program
US9589469B2 (en) Display control method, display control apparatus, and display apparatus
CN102975718B (en) In order to determine that vehicle driver is to method, system expected from object state and the computer-readable medium including computer program
CN110826369A (en) Driver attention detection method and system during driving
JPWO2020145161A1 (en) Information processing equipment, mobile devices, and methods, and programs
CN113040459A (en) System and method for monitoring cognitive state of vehicle rider
CN113276822B (en) Driver state estimating device
CN113276821B (en) Driver state estimating device
JP7342637B2 (en) Vehicle control device and driver condition determination method
JP7342636B2 (en) Vehicle control device and driver condition determination method
JP4345526B2 (en) Object monitoring device
US11908208B2 (en) Interface sharpness distraction mitigation method and system
Classen et al. 10 Driver Capabilities in the Resumption of Control
Hristov Influence of Road Geometry on Driver’s Gaze Behavior on Motorways
US20240051585A1 (en) Information processing apparatus, information processing method, and information processing program

Legal Events

Date Code Title Description
FZDE Discontinued

Effective date: 20210831

FZDE Discontinued

Effective date: 20210831