WO2024074246A1 - System and method for evaluation of the driving of a vehicle - Google Patents

System and method for evaluation of the driving of a vehicle Download PDF

Info

Publication number
WO2024074246A1
WO2024074246A1 PCT/EP2023/073106 EP2023073106W WO2024074246A1 WO 2024074246 A1 WO2024074246 A1 WO 2024074246A1 EP 2023073106 W EP2023073106 W EP 2023073106W WO 2024074246 A1 WO2024074246 A1 WO 2024074246A1
Authority
WO
WIPO (PCT)
Prior art keywords
vehicle
sensors
scene
image
computer
Prior art date
Application number
PCT/EP2023/073106
Other languages
French (fr)
Inventor
Vinod RAJENDRAN
Original Assignee
Continental Automotive Technologies GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Continental Automotive Technologies GmbH filed Critical Continental Automotive Technologies GmbH
Publication of WO2024074246A1 publication Critical patent/WO2024074246A1/en

Links

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W40/00Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models
    • B60W40/08Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models related to drivers or passengers
    • B60W40/09Driving style or behaviour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W50/08Interaction between the driver and the control system
    • B60W50/14Means for informing the driver, warning the driver or prompting a driver intervention
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • B60W60/0027Planning or execution of driving tasks using trajectory prediction for other traffic participants
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle

Definitions

  • the invention relates generally to the evaluation of the driving of a vehicle, and more specifically to a system and method for evaluation of the driving of an autonomous or non- autonomous driving vehicle based on the surrounding environment.
  • the methods described tend to analyse and evaluate the driving of a vehicle by considering how the vehicle is manipulated.
  • the driving of a vehicle cannot be considered in isolation as many factors contribute to the driving of a vehicle and associated safety level on the road, including the complexity of the environment surrounding the vehicle driven.
  • evaluation of the driving of a vehicle cannot be considered at a single timepoint based on a specific manoeuvre of the vehicle. An accurate evaluation of the driving of a vehicle is therefore not achieved in existing methods which focus on the manipulation of a vehicle in isolation and based on a single manoeuvre of the vehicle.
  • the present invention provides a computer- implemented method for evaluation of the driving of an autonomous or a non-autonomous driving vehicle, the method comprising: receiving a plurality of images captured by a plurality of sensors, wherein the plurality of sensors is mounted on a vehicle and positioned to capture a scene on a side of the vehicle; selecting a representative sensor pose for the side of the vehicle based at least on the plurality of images; and determining at least one evaluation metric based on at least one image associated with the representative sensor pose.
  • the computer-implemented method of the present invention is advantageous over known methods as the method provides a better evaluation of the driving of a vehicle by accounting for the surroundings of the ego vehicle, wherein such evaluation may be used to increase overall driving safety.
  • the computer-implemented method may be used to evaluate the driving of any vehicle, whether or not the driving is by a driver or an autonomous driving system.
  • the computer-implemented method of the present invention provides information on the surroundings of the ego vehicle by using sensors to capture information about the scene surrounding the ego vehicle, and using the information captured by the sensors to analyse the scene surrounding the ego vehicle.
  • the computer-implemented method of the present invention is also advantageous as the selection of a representative pose for each plurality of sensors based at least on the plurality of images ensures that the images subsequently used to analyse the scene surrounding the ego vehicle are the most informative and representative of the scene surrounding the ego vehicle.
  • a preferred method of the present invention is a computer-implemented method as described above, wherein the plurality of sensors captures at least a 180° field of view of the side of the vehicle, and/or each sensor of the plurality of sensors has an overlapping field of view with a neighbouring sensor, and/or each sensor of the plurality of sensors has a field of view of at least 60°.
  • the above-described aspect of the present invention has the advantage that as much information is obtained and captured about the surroundings as possible, by having a maximum field of view covered by each plurality of sensors, while ensuring that no portion of the environment surrounding the ego vehicle is missed out and keeping the number of sensors required low.
  • a preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein selecting a representative sensor pose for the side of the vehicle is further based on at least one novel image of the scene on the side of the vehicle.
  • the above-described aspect of the present invention has the advantage that using novel images of the scene on the side of the ego vehicle allows a more comprehensive evaluation and analysis of the scene surrounding the vehicle without the incorporation of additional sensors.
  • the novel images may also allow the detection of traffic participants which may otherwise be occluded or partially occluded in the individual received plurality of images.
  • a preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the at least one novel image of the scene on the side of the vehicle is generated by: receiving depth data corresponding to the plurality of images captured by the plurality of sensors; generating a representation of the scene on the side of the vehicle based on the received plurality of images and corresponding depth data, wherein the representation accounts for space and time; and generating at least one novel image of the scene on the side of the vehicle based on the representation of the scene on the side of the vehicle.
  • the above-described aspect of the present invention has the advantage that generating a representation of the scene that accounts for space and time allows the generation of novel images with perspectives of unseen poses or dimensions for a more comprehensive overview of the scene at the side of the ego vehicle as compared to only relying on images captured by the at least one plurality of sensors, which may comprise one or more partially or completely occluded objects depending on the sensor pose and the time at which the plurality of images and corresponding data are recorded or captured.
  • a preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein selecting a representative sensor pose for the side of the vehicle comprises: identifying a plurality of traffic participants in the plurality of images and/or at least one novel image; and selecting, from the plurality of images and/or at least one novel image, a representative image, wherein the representative sensor pose is a sensor pose associated with the representative image, and wherein preferably the representative image is the image that comprises the largest number of traffic participants.
  • the above-described aspect of the present invention has the advantage that by selecting an image with the most traffic participants as the representative image, the viewpoint corresponding to such representative image would most likely be the viewpoint from which most information of the scene at the side of the ego vehicle may be obtained.
  • a preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the at least one evaluation metric is based on a complexity of the scene at the side of the vehicle and one or more potential collision events.
  • the above-described aspect of the present invention has the advantage that using at least one evaluation metric based on the complexity of the scene surrounding the ego vehicle and one or more potential collision events allows a better evaluation of the driving of the ego vehicle in context, i.e., evaluating the driving based on how the vehicle is manoeuvred to prevent accidents or collisions while taking into account how complex the surroundings are.
  • a preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the at least one evaluation metric comprises one or more of: a number of potential collision events, a number of traffic participants, or a number of lane changes by other traffic participants.
  • the above-described aspect of the present invention has the advantage that the usage of evaluation metrics such as number of potential collision events, number of traffic participants, or number of lane changes by other traffic participants provide a better evaluation of the driving of the ego vehicle by considering how the vehicle is manoeuvred in response to the actions of other traffic participants.
  • a preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the at least one evaluation metric comprises a number of potential collision events, wherein a potential collision event is determined based on a time to near-collision for each traffic participant, wherein determining a number of potential collision events preferably comprises: receiving consecutive images associated with the representative sensor pose; predicting a time to near-collision for each traffic participant, more preferably using an image-based collision prediction model; and determining a number of potential collision events.
  • the above-described aspect of the present invention has the advantage that the evaluation of the driving of the vehicle is more accurate as the likelihood of a collision between the ego vehicle and another traffic participant is accounted for.
  • Such likelihood of a collision is accounted for based on a time to near-collision, wherein the time to near- collision is the predicted time at which at least one traffic participant is going to come within a predefined distance from the vehicle.
  • the shorter the time to near-collision the more unsafe the driving of the vehicle as it indicates that the vehicle is driving at a close proximity to other traffic participants.
  • the usage of an image-based collision prediction model is more preferred as it only requires images as input to predict time to near-collision.
  • a preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the at least one evaluation metric comprises a number of lane changes by other traffic participants, wherein determining a number of lane changes by other traffic participants preferably comprises: receiving consecutive images associated with the representative sensor pose; predicting a lane change behaviour for each traffic participant, more preferably using an image-based lane change model; and determining a number of lane changes by other traffic participants.
  • the above-described aspect of the present invention has the advantage that the evaluation of the driving of the vehicle is more accurate as the behaviour of other traffic participants is accounted for. Such behaviour of other traffic participants is accounted for based on the number of lane change behaviour. In general, the higher the number of lane changes by other traffic participants, the more complex the scene and the more difficult it is for the vehicle to navigate. The usage of an image-based lane change model is more preferred as it only requires images as input to predict lane change behaviour.
  • the invention also relates to a computer-implemented method for evaluation of the driving of an autonomous or a non-autonomous driving vehicle, the method comprising: carrying out the computer-implemented method of any of the preceding claims on each side of the vehicle to determine at least one evaluation metric for each side of the vehicle; and generating at least one aggregated evaluation metric.
  • a preferred method of the present invention is a computer-implemented method as described above, further comprising: generating a control signal to control the vehicle based on the at least one evaluation metric and/or at least one aggregated evaluation metric, wherein the control signal is preferably configured to: generate an alert for a driver of the vehicle; generate an alert for other traffic participants; stop and/or adjust driving functions of the vehicle; and/or adjust an autonomous driving function controlling the vehicle.
  • control signal may be configured to control the vehicle to address any unsafe driving behaviour indicated by the at least one evaluation metric and/or at least one aggregated evaluation metric to increase the overall driving safety on the roads.
  • the invention also relates to system comprising at least one plurality of sensors, at least one processor and a memory that stores executable instructions for execution by the at least one processor, the executable instructions comprising instructions for performing the computer-implemented method of the invention.
  • the invention also relates to a vehicle comprising the system of the invention, wherein the system comprises four plurality of sensors mounted on the vehicle and positioned to capture a scene surrounding the vehicle.
  • the above-described advantageous aspects of a computer-implemented method, system, or vehicle of the invention also hold for all aspects of below-described computer program, a machine-readable storage medium, or a data carrier signal of the invention. All below-described advantageous aspects of a computer program, a machine-readable storage medium, or a data carrier signal of the invention also hold for all aspects of an abovedescribed computer-implemented method, system, or vehicle of the invention.
  • the invention also relates to a computer program, a machine-readable storage medium, or a data carrier signal that comprises instructions, that upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of a computer-implemented method according to the invention.
  • the machine-readable medium may include any medium and/or mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device).
  • the machine-readable medium may be any medium, such as for example, read-only memory (ROM); random access memory (RAM); a universal serial bus (USB) stick; a compact disc (CD); a digital video disc (DVD); a data storage device; a hard disk; electrical, acoustical, optical, or other forms of propagated signals (e.g., digital signals, data carrier signal, carrier waves), or any other medium on which a program element as described above can be transmitted and/or stored.
  • ROM read-only memory
  • RAM random access memory
  • USB universal serial bus
  • CD compact disc
  • DVD digital video disc
  • data storage device e.g., compact disc, CD
  • hard disk electrical, acoustical, optical, or other forms of propagated signals (e.g., digital signals, data carrier signal, carrier waves), or any other medium on which a program element as described above can be transmitted and/or stored.
  • the term “sensor” includes any sensor that detects or responds to some type of input from a perceived environment or scene. Examples of sensors include cameras, video cameras, LiDAR sensors, radar sensors, depth sensors, light sensors, colour sensors, or red, green, blue, and distance (RGBD) sensors.
  • sensors include cameras, video cameras, LiDAR sensors, radar sensors, depth sensors, light sensors, colour sensors, or red, green, blue, and distance (RGBD) sensors.
  • sensor data means the output or data of a device, also known as a sensor, that detects and responds to some type of input from the physical environment.
  • the term “scene” refers to a distinct physical environment that may be captured by one or more sensors.
  • a scene may include one or more objects that may be captured by one or more sensors, whether such object is stationary, static, or mobile.
  • vehicle refers to any mobile agent capable of movement, including cars, trucks, buses, agricultural machines, forklift, robots, whether or not such mobile agent is capable of carrying or transporting goods, animals, or humans, and whether or not such mobile agent is driven by a human or is an autonomous mobile agent.
  • ego vehicle refers to the vehicle which is of primary interest in the evaluation of driving, whether or not such vehicle is driven by a human or is an autonomous driving system.
  • FIG. 1 is a schematic illustration of a system for evaluation of the driving of an autonomous or non-autonomous driving vehicle, in accordance with embodiments of the present disclosure
  • FIG. 2 is a flowchart of a method for evaluation of the driving of an autonomous or non-autonomous driving vehicle, in accordance with embodiments of the present disclosure
  • FIG. 3 is a schematic illustration of a method of generating at least one novel image of the scene on the side of the vehicle, in accordance with embodiments of the present disclosure
  • FIG. 4 is a schematic illustration of a method of generating a representation of a scene, in accordance with embodiments of the present disclosure
  • FIG. 5 is a schematic illustration of a method of selecting a representative sensor pose for the side of the vehicle, in accordance with embodiments of the present disclosure
  • FIG. 6 is a schematic illustration of a method of determining the evaluation metric of number of potential collision events, in accordance with embodiments of the present disclosure
  • Fig. 7 is a schematic illustration of a method of determining the evaluation metric of number of lane changes by other traffic participants, in accordance with embodiments of the present disclosure
  • FIG. 8 is a schematic illustration of the method for evaluation of the driving of an autonomous or non-autonomous driving vehicle, in accordance with embodiments of the present disclosure
  • FIG. 9 is a schematic illustration of a method of generating a control signal, in accordance with embodiments of the present disclosure.
  • Fig. 10 is a schematic illustration of a method of generating a total score, in accordance with embodiments of the present disclosure.
  • the present disclosure is directed to methods, systems, vehicles, computer programs, machine-readable storage media, and data carrier signals for evaluation of the driving of a vehicle.
  • Evaluation of the driving of the vehicle is carried out by the usage of sensors to analyse the complexity of the scene surrounding the vehicle, wherein such complexity of the scene surrounding the vehicle is accounted for when evaluating the driving of the vehicle.
  • a representation of the scene surrounding the vehicle and novel images of the scene surrounding the vehicle may be generated in order to get a more comprehensive understanding and subsequent evaluation of the scene surrounding the vehicle.
  • the evaluation results may be aggregated over a journey for an even more comprehensive evaluation of the driving of the vehicle.
  • first means “first,” “second,” etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another.
  • a first plurality of sensors could be termed a second plurality of sensors, and, similarly, a third plurality of sensors could be termed a first plurality of sensors, without departing from the scope of the various described embodiments.
  • the first plurality of sensors, the second plurality of sensors, the third plurality of sensors, and the fourth plurality of sensors are all sets of sensors, but they are not the same plurality of sensors.
  • the illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that on-going technological development will change the manner in which particular functions are performed.
  • Fig. 1 is a schematic illustration of a system for evaluation of the driving of an autonomous or non-autonomous driving vehicle, in accordance with embodiments of the present disclosure.
  • System 100 for evaluation of the driving of a vehicle may comprise a plurality of sensors 108 and at least one processor 116.
  • the plurality of sensors 108 may be mounted on the vehicle and positioned to capture a scene on a side of the vehicle.
  • the plurality of sensors 108 may capture at least a 180° field of view of the side of the vehicle.
  • each sensor of the plurality of sensors 108 may have an overlapping field of view with a neighbouring sensor.
  • each sensor of the plurality of sensors 108 may have a field of view of at least 60°.
  • the plurality of sensors 108 may be mounted on a front of the vehicle, a rear of the vehicle, a left of the vehicle, or a right of the vehicle.
  • the plurality of sensors 108 may comprise visible light sensors to generate a plurality of images 124.
  • a visible light sensor captures information relating to the colour of objects in a scene and generates images of a scene.
  • the visible light sensor may be a camera or video camera configured to capture a plurality of images 124.
  • the plurality of sensors 108 may further comprise depth sensors to generate depth data 132 corresponding to the plurality of images 124.
  • a depth sensor captures information relating to the distances of surfaces of objects and the ground in a scene.
  • the depth sensor may be an infrared or lidar camera or video camera.
  • the plurality of sensors 108 may comprise red, green, blue, depth (RGBD) cameras which capture both visible light information and corresponding depth information to generate the plurality of images 124 with corresponding depth data 132.
  • RGBD red, green, blue, depth
  • the plurality of sensors 108 may comprise red, green, blue (RGB) cameras which capture visible light information to generate the plurality of images 124, and infrared or lidar cameras which capture depth information to generate depth data 132, and each of the plurality of images 124 may be associated with a corresponding depth data 132.
  • the plurality of sensors 108 may capture the plurality of images 124 and/or corresponding depth data 132 continuously over a period of time or over a journey.
  • the plurality of sensors 108 may capture the plurality of images 124 and/or corresponding depth data 132 at fixed time intervals over a period of time or over a journey.
  • each image of the plurality of images 124 and corresponding depth data 132 may be associated a timestamp corresponding to the timepoint at which such image of the plurality of images 124 and/or corresponding depth data 132 was captured.
  • the timepoint indicated by the timestamp may be the time elapsed since vehicle started to move.
  • the timepoint indicated by the timestamp may be a point in time as measured in hours past midnight or noon.
  • each image of the plurality of images 124 and its corresponding depth data 132 may each be associated with a sensor pose, wherein the sensor pose corresponds to the pose (position and orientation) of the specific sensor that captured such image of the plurality of images 124 and corresponding depth data 132.
  • each image of the plurality of images 124 and its corresponding depth data 132 may each be associated with one or more sensor parameters, wherein the one or more sensor parameters correspond to the parameters of the specific sensor that captured such image of the plurality of images 124 and corresponding depth data 132.
  • the plurality of images 124 and/or corresponding depth data 132, with the associated timestamp, sensor pose and/or one or more sensor parameters, may be received and processed by the at least one processor 116.
  • the at least one processor 116 may receive the plurality of images 124 and/or corresponding depth data 132 from the plurality of sensors 108.
  • the plurality of images 124 and corresponding depth data 132 from the one plurality of sensors 108 may be stored in a database 140 and may be retrieved from the database 140 to be processed by the at least one processor 116.
  • the at least one processor 116 may comprise several modules.
  • the at least one processor 116 may comprise an image processing module 148.
  • image processing module 148 may be configured to receive the plurality of images 124 and corresponding depth data 132 and generate at least one novel image 156 of the scene on the side of the vehicle.
  • the at least one novel image 156 may comprise one, some, or all possible viewpoints of the scene to get perspectives of unseen poses or dimensions of the scene at the side of the ego vehicle.
  • the at least one processor 116 may comprise a representative sensor pose selection module 164.
  • representative sensor pose selection module 164 be coupled to the image processing module 148 for example, by manner of one or both of wired coupling and wireless coupling.
  • representative sensor pose selection module 164 may receive the plurality of images 124 and/or at least one novel image 156 generated by the image processing module 148 to select a representative sensor pose for side of the vehicle based at least on the plurality of images 124 and/or at least one novel image 156.
  • the at least one processor 116 may comprise an evaluation module 172.
  • the evaluation module 172 may be coupled to the representative sensor pose selection module 164 for example, by manner of one or both of wired coupling and wireless coupling. In some embodiments, the evaluation module 172 may determine at least one evaluation metric based on at least one image associated with the representative sensor pose selected by the representative sensor pose selection module 164. In some embodiments, the at least one image associated with the representative sensor pose may comprise the plurality of images 124 captured by the plurality of sensors 108. In some embodiments, the at least one image associated with the representative sensor pose may comprise the at least one novel image 156 generated by image processing module 148.
  • the at least one image associated with the representative sensor pose may comprise a combination of the plurality of images 124 captured by the plurality of sensors 108 and the at least one novel image 156 generated by image processing module 148.
  • the evaluation module 172 may generate at least one aggregated evaluation metric based on the at least one evaluation metric determined for all four sides of the vehicle. In some embodiments, the evaluation module 172 may monitor the at least one evaluation metric and/or at least one aggregated evaluation metric over a period of time.
  • the evaluation of driving of the vehicle may be communicated via an interactive dashboard (e.g., an electronic dashboard module) which can, for example, facilitate the provision of a report and/or feedback to a driver or a manufacturer of the vehicle. It is further contemplated that, in accordance with an embodiment of this disclosure, the evaluation of the driving of the vehicle can, for example, be communicated to a third-party service provider and/or other persons of interest.
  • an interactive dashboard e.g., an electronic dashboard module
  • the evaluation of the driving of the vehicle can, for example, be communicated to a third-party service provider and/or other persons of interest.
  • the at least one processor 116 may comprise a control module 180 that is configured to generate a control signal for the vehicle based on the at least one evaluation metric and/or at least one aggregated evaluation metric determined by evaluation module 172.
  • the control module 180 may be coupled to the evaluation module 172 for example, by manner of one or both of wired coupling and wireless coupling.
  • the control signal may be adapted to control the vehicle.
  • the control signal may be generated when the at least one evaluation metric and/or at least one aggregated evaluation metric indicates an unsafe driving behaviour.
  • the control signal may be configured to generate an alert for a driver of the vehicle.
  • the alert for the driver may take any form, including visual, auditory, tactile, or any combination thereof. Examples include an audible alarm or voice notification, a visible notification on a dashboard display, or a vibration.
  • the control signal may be configured to generate an alert for other traffic participants.
  • the alert for the other traffic participants may take any form, including visual, auditory, or any combination thereof. Examples include the activation of the horn, or the flashing of the headlights.
  • the control signal may be configured to stop and/or adjust driving functions of the vehicle.
  • the control signal may be configured to stop driving functions of the vehicle and switch to an autonomous driving function.
  • the control signal may be configured to limit the maximum speed at which the vehicle may travel at.
  • the control signal may be configured to adjust an autonomous driving function controlling the vehicle.
  • each of the modules described may be an integrated software-hardware based module (e.g., an electronic part which can carry a software program/algorithm in association with receiving and processing functions/an electronic module programmed to perform the functions of receiving, processing and/or transmitting).
  • the present disclosure yet further contemplates the possibility that each of the modules described can be an integrated hardware module (e.g., a hardware-based transceiver) capable of performing the functions of receiving, processing and/or transmitting.
  • Fig. 2 is a flowchart of a method for evaluation of the driving of an autonomous or non-autonomous driving vehicle, in accordance with embodiments of the present disclosure.
  • Computer-implemented method 200 for evaluation of the driving of a vehicle may be implemented by a data processing device on any architecture and/or computing system.
  • various architectures employing, for example, multiple integrated circuit (IC) chips and/or packages, and/or various computing devices and/or consumer electronic (CE) devices such as multi -function devices, tablets, smart phones, etc., may implement the techniques and/or arrangements described herein.
  • Method 200 may be stored as executable instructions that, upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of method 200.
  • method 200 for evaluation of the driving of a vehicle may be implemented by system 100.
  • method 200 for evaluation of the driving of a vehicle may commence at operation 208, wherein a plurality of images 124 captured by a plurality of sensors 108 mounted on a vehicle and positioned to capture a scene on a side of the vehicle is received.
  • the plurality of images 124 may be received from the plurality of sensors 108, for example, by manner of one or both of wired coupling and wireless coupling.
  • the plurality of images 124 may be stored on and retrieved from database 140, for example, by manner of one or both of wired coupling and wireless coupling.
  • the plurality of sensors 108 may capture at least a 180° field of view of the side of the vehicle which the plurality of sensors 108 is mounted on. In some embodiments, each sensor of the plurality of sensors 108 may have an overlapping field of view with a neighbouring sensor. In some embodiments, each sensor of the plurality of sensors 108 may have a field of view of at least 60°. In some embodiments, the plurality of images 124 may be captured by the plurality of sensors 108 over a period of time. In some embodiments, the plurality of images 124 may be captured by the plurality of sensors 108 at the same timepoint.
  • the plurality of sensors 108 may be mounted on a front of the vehicle, a rear of the vehicle, a left of the vehicle, or a right of the vehicle. In some embodiments, there may be two or more plurality of sensors 108 mounted on the vehicle. In some embodiments, there may be four plurality of sensors 108 mounted on the vehicle: a first plurality of sensors 108 mounted on a front of the vehicle, a second plurality of sensors 108 mounted on a right of the vehicle, a third plurality of sensors 108 mounted on a left of the vehicle, and a fourth plurality of sensors 108 mounted on a rear of the vehicle.
  • method 200 may be carried out for each of the first, second, third, and fourth plurality of sensors 108.
  • method 200 may comprise operation 216 wherein a representative sensor pose is selected for the side of the vehicle based at least on the plurality of images 124.
  • the representative sensor pose for the side of the vehicle is the sensor pose which is likely to provide the most information about the scene at such side of the ego vehicle.
  • operation 216 may be carried out by representative sensor pose selection module 164.
  • the representative sensor pose selected may be the pose of one of the sensors of the plurality of sensors.
  • the representative sensor pose selected may be an unseen sensor pose, which is a sensor pose that is not any of the poses of the sensors of the plurality of sensors.
  • the selection of a representative sensor pose may be based on the plurality of images 124 received in operation 208.
  • the selection of a representative sensor pose may be further based on at least one novel image 156 that may be generated in an operation 224. Operation 224 for the generation of at least one novel image 156 is discussed in further detail in relation to Figs. 3 and 4. Operation 216 for the selection of the representative sensor pose is discussed in further detail in relation to Fig. 5.
  • method 200 may comprise operation 232 wherein at least one evaluation metric is determined based on at least one image associated with the representative sensor pose selected in operation 216.
  • operation 232 may be carried out by evaluation module 172.
  • the at least one image associated with the representative sensor pose may comprise the plurality of images 124 received in operation 208, the at least one novel image 156 generated in operation 224, or some combination thereof.
  • the at least one evaluation metric may be any evaluation metric that may be indicative of the scene at the side of the ego vehicle, and/or any interaction of the ego vehicle with the scene at the side of the ego vehicle.
  • the at least one evaluation metric may be based on a complexity of the scene at the side of the vehicle and one or more potential collision events.
  • the at least one evaluation metric may comprise one or more of: a number of potential collision events, a number of traffic participants, or a number of lane changes by other traffic participants.
  • the determination of the evaluation metric of number of potential collision events is discussed in further detail in relation to Fig. 6, and the determination of the evaluation metric of number of lane changes by other traffic participants is discussed in further detail in relation to Fig. 7.
  • Table 1 below is an example of an output from operation 232, wherein the at least one evaluation metric comprises: a number of potential collision events, a number of traffic participants, and a number of lane changes by other traffic participants.
  • method 200 may comprise operation 240 wherein a control signal to control the vehicle is generated based on the at least one evaluation metric determined in operation 232.
  • operation 240 may be carried out by control module 180.
  • the control signal may be generated when the at least one evaluation metric indicates unsafe driving behaviour.
  • the control signal may be configured to control the vehicle.
  • the control signal may be configured to generate an alert for a driver of the vehicle, generate an alert for other traffic participants, stop and/or adjust driving functions of the vehicle, and/or adjust an autonomous driving function controlling the vehicle.
  • Fig. 3 is a schematic illustration of a method of generating at least one novel image of the scene on the side of the vehicle, in accordance with embodiments of the present disclosure.
  • method 300 of generating at least one novel image of the scene on the side of the vehicle may be carried out at operation 224 of method 200.
  • method 300 may be carried out by image processing module 148.
  • method 300 of generating at least one novel image of the scene on the side of the vehicle may comprise operation 308 wherein depth data 132 corresponding to the plurality of images 124 is received.
  • the corresponding depth data 132 may be received from the plurality of sensors 108, for example, by manner of one or both of wired coupling and wireless coupling.
  • the corresponding depth data 132 may be stored on and retrieved from database 140, for example, by manner of one or both of wired coupling and wireless coupling.
  • method 300 may comprise operation 316 wherein a representation of the scene on the side of the vehicle is generated based on the received plurality of images 124 and corresponding depth data 132, wherein the representation of the scene accounts for space and time.
  • the representation of the scene may be generated based on the received plurality of images 124 captured over a period of time.
  • the representation of the scene on the side of the vehicle may be generated using any known method used to generate scene representations that account for space and time.
  • NSFF Neural Scene Flow Fields
  • MLPs multi-layer perceptrons
  • a neural network generally comprises an input layer comprising one or more input nodes, one or more hidden layers each comprising one or more hidden nodes, and an output layer comprising one or more output nodes.
  • An MLP is a type of neural network comprising a series of fully connected layers that connect every neuron in one layer to every neuron in the preceding and subsequent layer. The detailed architectures of the MLPs of the static and dynamic scene representations are illustrated in Figs. 2 and 3 respectively of the supplementary material of the paper by Li et. al.
  • the static scene representation may be a neural radiance field (NeRF) as disclosed and described in detail in “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis” by Mildenhall et. al.
  • the dynamic scene representation may thus be defined using the equation d, ), wherein subscript i indicates a value at a specific time i.
  • Fig. 4 is a schematic illustration of a method of generating a representation of a scene, in accordance with embodiments of the present disclosure.
  • method 400 of generating a representation of the scene may be carried out at operation 316 of method 300.
  • method 400 may be carried out by image processing module 148.
  • the plurality of sensors comprises 5 sensors, although it is emphasized that any other number of sensors may be employed.
  • a different representation of a scene may be generated for each scene using the Adam optimizer with a learning rate of 0.0005.
  • Each representation of the scene comprises two networks, i.e., a static scene representation and a dynamic scene representation, and both networks are optimised simultaneously. It is emphasised that method 400 is only an example of a method of generating a representation of a scene and other appropriate methods may be employed.
  • method 400 may be carried out with the plurality of images 124 and corresponding depth data 132 as input.
  • the received plurality of images 124 and corresponding depth data 132 may be captured sequentially by the plurality of sensors.
  • the plurality of images and corresponding data may be captured in a sequence from a left-most sensor to a right-most sensor of the plurality of sensors, or vice versa.
  • the images and corresponding data may be captured in sequence with a time difference of between 0.5 and 1 milliseconds.
  • Each of the plurality of images 124 and corresponding depth data 132 is associated with a sensor pose (position and orientation) and intrinsic parameters of the specific sensor that captured such image and corresponding depth data.
  • Each of the plurality of images 124 and corresponding depth data 132 may also associated with a timestamp corresponding to the time at which such image and corresponding depth data is captured.
  • method 400 may comprise operation 408, wherein a static scene representation 416 is generated.
  • the static scene representation 416 may be a 5D scene representation comprising a multilayer perceptron.
  • the static scene representation 416 may be a 5D scene representation approximated with two multilayer perceptron networks, which may be termed as a coarse network and a fine network, wherein both multilayer perceptron networks may have the same architecture.
  • operation 408 of generating static scene representation 416 may comprise the following steps: a) Sample a batch of camera rays from the set of all pixels in the received plurality of images 124.
  • Hierarchical sampling procedure may be used to query N c samples from the coarse network and N c + Nf samples from the fine network.
  • b) Pass the positional encoding of input 3D location x to 8 fully-connected layers (using ReLU activations and 256 channels per layer), to output c and a 256-dimensional feature vector.
  • c) Concatenate the feature vector from the preceding step with the camera ray’s viewing direction and pass the concatenated vector to one additional fully-connected layer (using a ReLU activation and 128 channels) to output the view-dependent emitted colour c.
  • method 400 may comprise operation 424, wherein a dynamic scene representation 432 is generated.
  • the plurality of images 124 and their corresponding depth data 132 may be ordered in sequence, wherein the ordering sequence represents time i.
  • the plurality of images 124 and corresponding depth data 132 may be ordered based on the sequence in which such images and corresponding depth data were captured.
  • the final dataset used to generate the dynamic representation of the scene thus comprises the images, the corresponding sensor poses, the corresponding depth data, intrinsic parameters, and the time.
  • operation 424 of the generation of dynamic scene representation 432 may comprise the following steps: a) Sample a batch of camera rays from the set of all pixels in the received plurality of images 124.
  • method 400 may comprise operation 440 wherein a combined volume rendered image 448 is generated.
  • operation 440 may comprise combining volume rendered images from the static scene representation 416 and dynamic scene representation 432.
  • operation 440 may comprise performing a linear combination of static scene components from static scene representation 416 and dynamic scene components from dynamic scene representation 432.
  • method 400 may comprise operation 456 wherein a total loss 464 is calculated based on the combined volume rendered image 448.
  • operation 456 may comprise combining with weights a combined static and dynamic loss (Z c />), a weighted temporal photometric loss (L p ho a cycle consistency loss (Lcyc), a geometric consistency loss (Lgeo), and a single-view depth loss (Ldata).
  • the geometric consistency loss may be computed using pre-trained networks such as FlowNet 2.0 available at https://github.com/lmb-kaburg/flownet2or RAFT available at https://github.com/princeton-vl/RAFT.
  • the single-view depth loss may be computed using a pre-trained single-view depth network.
  • a pre-trained single-view depth network is MiDAS network as disclosed in “Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer” by Ranftl et. al. pre-trained on the KITTI dataset available at https://www.cvlibs.net/datasets/kitti/.
  • the MiDAS network may be trained on the KITTI dataset which comprises 93,000 images split into a training and a test dataset with the ratio 80:20.
  • the MiDAS network uses a ResNet-based architecture with pretrained ImageNet weights, wherein an example of the ResNet may be found at least in Figure 3 of “Deep Residual Learning for Image Recognition” by He et. al.
  • the MiDAS network uses additional scale- and shift-invariant losses when training the ResNet architecture.
  • the details of the training of the MiDAS network may be found at least in Section 6 of the paper by Ranftl et. al. and additional detail on the scale- and shift-invariant losses may be found at least in Section 5 of the same paper.
  • the MiDAS network may be trained on the KITTI dataset with a depth cap of 80 meters for 100 epochs with a batch size of 24. .
  • method 400 may comprise operation 472 wherein the static scene representation 416 and dynamic scene representation 432 are optimized based on the total loss 464 calculated in operation 456.
  • the static scene representation 416 and dynamic scene representation 432 may be refined and adjusted over multiple iterations, wherein each iteration may comprise a cycle comprising operations 440, 456, and 472.
  • the static scene representation 416 and dynamic scene representation 432 may be optimized over at least 1,000,000 iterations, wherein the first 10007V iterations are denoted as an initialization stage to warm up the optimization, wherein TV is the number of training views.
  • the scene flow window size may be initialized as 3, wherein j G z, z ⁇ 1.
  • method 300 may comprise operation 324 wherein at least one novel image 156 is generated of the scene based on the scene representation generated in operation 316.
  • volume rendering techniques may be carried out on the static scene representation 416 and the dynamic scene representation 432, and the output of such volume rendering may be combined by performing a linear combination of static and dynamic scene components to generate at least one novel image 156.
  • the at least one novel image 156 generated may be based on one, some, or all possible viewpoints of the scene to get perspectives of unseen poses or dimensions to increase the comprehensiveness of the evaluation of the driving of the vehicle.
  • the at least one novel image 156 may be generated based on space-time interpolation based on predefined poses.
  • space-time interpolation may be performed using a splatting-based plane-sweep volume tracing approach.
  • G (0,1) at a specified novel viewpoint
  • every step emitted from the novel viewpoint is swept from front to back, and at each sampled step t along the ray, point information is queried through the static scene representation 416 and dynamic scene representation 432 at both times z and z + 1, and the 3D points are displaced at time z by the scaled scene flow x, + ⁇ ⁇ +K x d> an d similarity for time z+1.
  • the 3D displaced points may then be splatted onto a (c, a) accumulation buffer at the novel viewpoint, and spats may be blended from time z and z + 1 with linear weights 1 - dt, 3 wherein the final rendered view is obtained by volume rendering the accumulation buffer. Additional information on the splatting-based planesweep volume tracing approach may be found at least in Section 3.4 of the paper by Li et. al., and Section 4 of the associated supplementary material.
  • Fig. 5 is a schematic illustration of a method of selecting a representative sensor pose for the side of the vehicle, in accordance with embodiments of the present disclosure.
  • method 500 of selecting a representative sensor pose for the side of the vehicle may be carried out at operation 216 of method 200.
  • method 500 may be carried out by representative sensor pose selection module 164.
  • method 500 may comprise operation 508 wherein a plurality of traffic participants are identified in the plurality of images 124 and/or at least one novel image 156.
  • the identification of traffic participants in operation 508 may be carried out by using an object detection algorithm or model on each image to identify all objects present in each image, and filtering the results obtained.
  • An object detection algorithm or model generally performs image recognition tasks by taking an image as input and then predicting bounding boxes and class probabilities for each object in the image. Any known object detection model or algorithm may be used, including convolutional neural networks (CNN), single-shot detectors or support vector machines. Most object detection models use a CNN to extract features from an input image to predict the probability of learned classes.
  • CNN convolutional neural networks
  • an object detection model that may be employed is the You Only Look Once v7 (YOLOv7) algorithm as disclosed in “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors” by Wang et. al., wherein information on the architecture of YOLOv7 may be found at least in Section 3, and information on the training of YOLOv7 may be found at least in Section 5.
  • the object detection model used to detect or predict one or more objects within each image of the plurality of images 124 and/or at least one novel image 156 may be the pretrained YOLOv7 model available at https://github.com/WongKinYiu/yolov7.
  • method 500 may comprise operation 516 wherein a representative image is selected, wherein the representative sensor pose is a sensor pose associated with the representative image.
  • the representative image may be selected from the plurality of images 124 and/or at least one novel image 156.
  • the representative image may be selected based on the number of traffic participants detected in each image, wherein the representative image is preferably the image with the largest number of traffic participants detected in operation 508.
  • the representative image may be selected randomly from the two or more images. Once the representative image is selected, the sensor pose associated with such representative image is selected as the representative sensor pose.
  • Fig. 6 is a schematic illustration of a method of determining the evaluation metric of number of potential collision events, in accordance with embodiments of the present disclosure.
  • the number of potential collision events may be determined using any known collision prediction algorithms.
  • the number of potential collision events may be determined using a collision detection algorithm configured to receive image data and sensor data as input for depth scene analysis to localise and predict the relative movements of traffic participants to the vehicle. For example, when the representative sensor pose corresponds to the pose of one of the plurality of sensors, the image and corresponding depth data captured by the specific sensor may be used as input to such collision detection algorithm.
  • the number of potential collision events may be determined by a model configured receive image data to identify traffic participants and predict a risk estimator based on temporal cues.
  • Such models use the power of large data to predict potential collisions, without the need of relying on explicit geometric depth estimates or velocity information. It is emphasised that method 600 illustrated in Fig. 6 is an example of a method of determining the evaluation metric of number of potential collision events, and other methods may be used.
  • method 600 of determining the evaluation metric of number of potential collision events may be carried out in operation 232 of method 200. In some embodiments, method 600 of determining the evaluation metric of number of potential collision events may be carried out by evaluation module 172. [0094] According to some embodiments, method 600 may comprise operation 608 wherein consecutive images associated with the representative sensor pose are received. In some embodiments, when analysing the number of potential collision events at a current timepoint, the consecutive images may be images associated with previous or preceding timepoints. In some embodiments, the consecutive images may be within 0.5 to 1 milliseconds of each other. Where the representative sensor pose corresponds to the sensor pose of one of the sensors of the plurality of sensors, the consecutive images may be obtained from such sensor.
  • the consecutive images may be generated using the representation of the scene generated in operation 316 of method 300, in particular, by performing fixed-view time interpolation, wherein the fixed-view corresponds to the representative sensor pose. Additional information on performing fixed- view time interpolation may be found at least in Section 3.4 of the paper by Li et. al., and Section 4 of the associated supplementary material.
  • method 600 may comprise operation 616 wherein a time to near-collision is predicted for each traffic participant.
  • the time to near-collision is the predicted time at which at least one traffic participant is going to come within a predefined distance from the vehicle.
  • the predefined distance may be 1 metre.
  • a collision prediction model may be employed in operation 616.
  • an image-based collision prediction model may be employed in operation 616.
  • the image-based collision prediction model may comprise a convolutional neural network (CNN).
  • CNN convolutional neural network
  • a convolutional neural network (CNN) is a multi-layered feed-forward neural network, made by stacking many hidden layers on top of each other in sequence. The sequential design may allow CNNs to learn hierarchical features.
  • the hidden layers are typically convolutional layers followed by activation layers, some of them followed by pooling layers.
  • the CNN may be configured to identify pattens in data.
  • the convolutional layer may include convolutional kernels that are used to look for patterns across the input data.
  • the convolutional kernel may return a large positive value for a portion of the input data that matches the kernel’s pattern or may return a smaller value for another portion of the input data that does not match the kernel’s pattern.
  • a CNN is preferred as a CNN may be able to extract informative features from the training data without the need for manual processing of the training data. Also, CNN is computationally efficient as a CNN is able to assemble patterns of increasing complexity using the relatively small kernels in each hidden layer.
  • An example of an image-based collision prediction model that may be employed is the multi-stream VGG-16 architecture disclosed in “Forecasting Time-to- Collision from Monocular Video: Feasibility, Dataset, and Challenges” by Manglik et.
  • each stream takes a 224 x 224 RGB frame as input to extract spatial features, which are then concatenated across all frames preserving the temporal order and then fed into a fully-connected layer to output time to collision. Additional details of the architecture may be found at least in section IV.B and Fig. 4 of the paper by Manglik et. al.
  • the image-based collision prediction model may be trained on a near-collision dataset.
  • the near-collision dataset may comprise labelled sensor data recorded by sensors mounted on a data collection vehicle.
  • the sensor data of the training dataset may be generated by sensors mounted on all four sides of a data collection vehicle driven through multiple road environments and situations (e.g., roads and highways), during multiple time situations (e.g., peak hours, non-peak hours, morning, and night), different traffic participant densities, and different weather conditions (e.g., spring, summer, autumn, winter, rainy, sunny, and cloudy).
  • the sensor data may comprise image data captured by image sensor(s) and depth data captured by depth sensor(s) (e.g., LIDAR) for automatic ground truth label generation.
  • the sensors may be initially calibrated with correspondences. The calibration allows accurate projection of LIDAR points onto images to obtain estimated depth values for the image-based traffic participant detection.
  • YOLOv7 may be used to identify and generate bounding boxes around traffic participants in the images.
  • the 3D position of each detected bounding box may be obtained by computing a median distance of the pixels using the 3D point cloud.
  • Each image may be annotated with a binary label where a positive label indicates the presence of at least one traffic participant within a predefined distance (e.g., one metre) from the sensor.
  • a time to near-collision may be calculated for each traffic participant based on a short temporal history of few image frames. For example, when using a tuple of N consecutive image frames (//, h, ..., IN) recorded at a frame rate of 10 fps as history, estimation of proximity over the next 6 seconds may be carried out by looking at the next 60 binary labels in future annotated as ⁇ label n +i, label n +2, ..., label n +6o ⁇ .
  • the dataset may comprise 15,000 video sequences, split into 12,000 for training set and 3,000 for test set.
  • the count of data points for each time-to-collision interval should preferably be balanced.
  • the time-to-collision interval may vary between 0 - T seconds.
  • T may be set to 6 which results in. 0-1, 1-2, 2-3, 3-4, 4-5, 5-6 intervals in seconds.
  • the image-based collision prediction model may be trained on the near-collision dataset with the following steps: a) Initialise a VGG-16 network using weights pre-trained using the ImageNet dataset available at https://image-net.org/. b) Train the multi-stream architecture with shared weights using the above-described near-collision dataset for 100,000 epochs. c) Compute the loss which is a mean square error between the predicted time, i.e.,/ (Ii,
  • method 600 may comprise operation 624 wherein a number of potential collision events is determined.
  • the number of potential collision events may be determined based on a number of traffic participants within a predefined distance from the vehicle.
  • the number of potential collision event may correspond to the number of traffic participants within 1 metre from the vehicle.
  • the number of potential collision events may be based on a time to near- collision for each traffic participant, wherein the time to near-collision corresponds to a time at which at least one traffic participant is predicted to come within a predefined distance of the ego vehicle.
  • the predefined distance may be 1 metre.
  • the number of potential collision events may correspond to the number of traffic participants within 1 metre from the ego vehicle with a predicted time to near-collision of between 2.6 and 4 seconds.
  • Fig. 7 is a schematic illustration of a method of determining the evaluation metric of number of lane changes by other traffic participants, in accordance with embodiments of the present disclosure.
  • the number of lane changes by other traffic participants may be determined using any known lane change model configured to receive as input one or more consecutive image frames and to generate as output a lane change behaviour for each traffic participant.
  • the lane change model may be an image-based lane change model configured to determine a lane change behaviour for each traffic participant based on action recognition in image sequences captured by sensors. It is emphasised that method 700 illustrated in Fig. 7 is an example of a method of determining the evaluation metric of number of lane changes by other traffic participants, and other methods may be used.
  • method 700 of determining the evaluation metric of number of lane changes by other traffic participants may be carried out in operation 232 of 200. In some embodiments, method 700 of determining the evaluation metric of number of lane changes by other traffic participants may be carried out by evaluation module 172. Method 700 of determining the evaluation metric of number of lane changes by other traffic participants may preferably be carried out in embodiments where the plurality of sensors 108 is mounted on the front of the vehicle.
  • method 700 may comprise operation 708 wherein consecutive images associated with the representative sensor pose are received.
  • the consecutive images when analysing the number of lane changes by other traffic participants at a current timepoint, the consecutive images may be images associated with previous or preceding timepoints. In some embodiments, the consecutive images may be within 0.5 to 1 milliseconds of each other. Where the representative sensor pose corresponds to a sensor of the plurality of sensors, the consecutive images may be obtained from such sensor. Where the representative sensor pose corresponds to an unseen pose, the consecutive images may be generated using the representation of the scene generated in operation 316 of method 300, in particular, by performing fixed-view time interpolation, wherein the fixed-view corresponds to the representative sensor pose.
  • method 700 may comprise operation 716 wherein a lane change behaviour is determined for each traffic participant.
  • the lane change behaviour of each traffic participant may be classified into a left lane change (LLC), right lane change (RLC), and no lane change (NLC).
  • LLC left lane change
  • RLC right lane change
  • NLC no lane change
  • a lane change model may be employed in operation 716 to predict a lane change behaviour for each traffic participant.
  • the lane change model may be an image-based model that determines whether each traffic participant in an image captured at a certain timepoint is changing lanes by looking at images captured at previous or preceding timepoints, also termed as an observation horizon or time window.
  • image-based lane change model may implicitly include positional, contextual, and symbolic information.
  • An example of an image-based model is the spatio-temporal model named SlowFast network disclosed in “SlowFast Networks for Video Recognition” by Feichtenhofer et. al. wherein an example of the architecture may be found at least in Fig. 1 and section 3.
  • the SlowFast network uses a two-stream approach, wherein a first (slow) stream is designed to capture semantic information given by a few sparse images operating at low frame rates and slow refreshing speed, and a second (fast) stream is responsible for capturing rapidly changing motion by operating at high temporal resolution and fast refreshing speed.
  • Both streams use ResNet50 as a backbone and are fused by lateral connections.
  • the SlowFast network may 8 be defined with one convolutional layer, five residual blocks, and one fully connected layer adapted to three classes, namely left lane change (LLC), right lane change (RLC), and no lane change (NLC).
  • LLC left lane change
  • RLC right lane change
  • NLC no lane change
  • the SlowFast network or spatio-temporal model may be trained using a dataset comprising sets of videos for lane change classification classes, such as LLC, RLC, and NLC.
  • An example of a dataset that may be used is the PREVENTION dataset generated in “The PREVENTION dataset: a novel benchmark for PREdiction of VEhicles iNTentlONs” by Izquierdo et.
  • operation 716 may comprise uniformly sampling a number of clips from the consecutive images received in operation 708, stacking the clips with an image associated with the current timepoint, and cropping the stacked images with the same region of interest (ROI) as used during the training of the lane change model before inputting the images into the trained lane change model.
  • ROI region of interest
  • method 700 may comprise operation 724 wherein a number of lane changes by other traffic participants is determined.
  • the number of lane changes by other traffic participants may correspond to the number of traffic participants classified as left lane change (LLC) or right left change (REC) by the lane change model in operation 716.
  • Fig. 8 is a schematic illustration of the method for evaluation of the driving of an autonomous or non-autonomous driving vehicle, in accordance with embodiments of the present disclosure.
  • Computer-implemented method 800 for evaluation of the driving of a vehicle may be implemented by a data processing device on any architecture and/or computing system.
  • various architectures employing, for example, multiple integrated circuit (IC) chips and/or packages, and/or various computing devices and/or consumer electronic (CE) devices such as multi-function devices, tablets, smart phones, etc., may implement the techniques and/or arrangements described herein.
  • Method 800 may be stored as executable instructions that, upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of method 800.
  • method 800 for evaluation of the driving of a vehicle may be implemented by system 100, wherein system 100 comprises multiple plurality of sensors 108 mounted on the vehicle.
  • method 800 may comprise operation 808 wherein at least one evaluation metric is determined for each side of the vehicle.
  • the at least one evaluation metric may be determined by carrying out method 200 on each side of the vehicle.
  • at least one evaluation metric may be determined for a left side of the vehicle, at least one evaluation metric may be determined for a right side of the vehicle, at least one evaluation metric may be determined for a front side of the vehicle, and at least one evaluation metric may be determined for a rear side of the vehicle.
  • method 800 may comprise operation 816 wherein at least one aggregated evaluation metric is generated.
  • the at least one aggregated evaluation metric may be generated by summing up each of the at least one evaluation metric determined in operation 808.
  • the at least one aggregated evaluation metric may be generated by summing up each of the at least one evaluation metric determined for a left side of the vehicle, at least one evaluation metric determined for a right side of the vehicle, at least one evaluation metric determined for a front side of the vehicle, and at least one evaluation metric determined for a rear side of the vehicle.
  • method 800 may comprise operation 824 wherein a control signal is generated based on the at least one aggregated evaluation metric generated in operation 816.
  • the control signal may be generated when the at least one evaluation metric indicates unsafe driving behaviour.
  • the control signal may be configured to control the vehicle.
  • the control signal may be configured to generate an alert for a driver of the vehicle, stop driving functions of the vehicle, and/or adjust an autonomous driving function controlling the vehicle.
  • Fig. 9 is a schematic illustration of a method of generating a control signal, in accordance with embodiments of the present disclosure. Method 900 of generating a control signal may be carried out by control module 180 of system 100. In some embodiments, method 900 of generating a control signal may be carried out at operation 232 of method 200 and/or at operation 824 of method 800.
  • method 900 may comprise operation 908 wherein a total score is generated based on the at least one evaluation metric determined in operation 232 of method 200 and/or the at least one aggregated evaluation metric generated in operation 816 of method 800.
  • the total score may be generated at various time intervals during a journey. For example, the total score may be generated every 10 minutes in a journey.
  • the total score may be generated using a single machine learning model configured to receive the at least one evaluation metric determined in operation 232 of method 200 and/or the at least one aggregated evaluation metric generated in operation 816 of method 800 and determine a total score.
  • the total score may be generated using method 1000 described in relation to Fig. 10.
  • the total score may be provided as a report to the driver or a third party.
  • an annotated dataset may be generated to train the machine learning model configured to receive the at least one evaluation metric determined in operation 232 of method 200 and/or the at least one aggregated evaluation metric generated in operation 816 of method 800 and determine a total score.
  • the annotated dataset may comprise at least 2,500 instances.
  • An example of the steps to generate the annotated dataset is as follows: a) Collect data by driving a data collection vehicle driven through multiple road environments and situations (e.g., roads and highways), during multiple time situations (e.g., peak hours, non-peak hours, morning, and night), different traffic participant densities, and different weather conditions (e.g., spring, summer, autumn, winter, rainy, sunny, and cloudy).
  • the single machine learning model may be a trained neural network with a multilayer perceptron architecture.
  • Table 2 illustrates an example of the architecture of a trained neural network with a multilayer perceptron architecture, wherein N represents the number of inputs into the trained neural network.
  • the activation functions may be set to be the commonly used sigmoid activation function or ReLU activation function and the weights may be randomly initialized to numbers between 0.01 and 0.1, while the biases may be randomly initialized to numbers between 0.1 and 0.9.
  • the neural network may be adjusted by using a cost function, such as the commonly used Mean Error (ME), Mean Squared Error (MSE), Mean Absolute Error (MAE), or Root Mean Squared Error (RMSE).
  • ME Mean Error
  • MSE Mean Squared Error
  • MAE Mean Absolute Error
  • RMSE Root Mean Squared Error
  • the neural network may be adjusted by backpropagating the cost function to update the weights and biases using an AdamOptimizer.
  • neural network may be trained with a learning rate of 0.001 and weight decay of 0.0005. In some embodiments, the neural network may be trained from scratch for 50 to 100 epochs, although the number of epochs may vary depending on the size of the training dataset and/or the size of neural network.
  • a single machine learning model may be trained to determine a total score based on the at least one evaluation metric and/or at least one aggregated evaluation metric with the following details:
  • Dataset split 60:20:20 (training : verification : testing) • Input: place, journey start datetime, journey end datetime, weather, other participant lane change count, potential collision event count, traffic participant count, speed of vehicle, peak or not peak hours, etc.
  • method 900 may comprise operation 916 wherein a control signal is generated based on the total score generated in operation 908.
  • a control signal to control the vehicle may be generated when the total score indicates an unsafe driving behaviour. For example, where the maximum score is 100, a score below 40 may indicate unsafe driving behaviour.
  • the total score may be generated and monitored over a period of time before the control signal is generated.
  • the control signal may be generated only when the total score generated over a period of time consistently indicates an unsafe driving behaviour.
  • control signal may be configured to generate an alert to the driver, generate an alert to other traffic participants, or stop and/or adjust driving functions.
  • control signal may be configured to adjust an autonomous driving function controlling the vehicle.
  • Fig. 10 is a schematic illustration of a method of generating a total score, in accordance with embodiments of the present disclosure.
  • Method 1000 of generating a control signal may be carried out by control module 180 of system 100.
  • method 1000 of generating a total score may be carried out at operation 908 of method 900.
  • method 1000 may comprise operation 1008 wherein a rating is determined for each of the at least one evaluation metric and/or at least one aggregated evaluation metric.
  • the rating may be determined by mapping each of the at least one evaluation metric and/or at least one aggregated evaluation metric onto a rating on a predefined scale.
  • the predefined scale may be 1 to 5 and the rating may be any value between 1 to 5. Any other predefined scale values may also be employed.
  • the mapping may be based on predefined rules that were defined manually by an expert or automatically.
  • the at least one evaluation metric and/or at least one aggregated evaluation metric may be mapped onto a predefined scale automatically using one or more machine learning models configured to predict a rating based on the at least one evaluation metric and/or at least one aggregated evaluation metric.
  • machine learning models include random forest, support vector machines, and neural networks.
  • separate machine learning models may be trained for different locations, such as countries, cities, and towns.
  • an annotated dataset may be generated to train the one or more machine learning models.
  • the annotated dataset may comprise at least 2,500 instances.
  • An example of the steps to generate the annotated dataset is as follows: d) Collect data by driving a data collection vehicle driven through multiple road environments and situations (e.g., roads and highways), during multiple time situations (e.g., peak hours, non-peak hours, morning, and night), different traffic participant densities, and different weather conditions (e.g., spring, summer, autumn, winter, rainy, sunny, and cloudy). e) Determine the at least one evaluation metric and/or at least one aggregated evaluation for the data collected. f) Annotate the data with a rating on a predefined scale. The annotation may be carried out by an expert.
  • the one or more machine learning models may be a trained neural network with a multilayer perceptron architecture.
  • the architecture and training of the trained neural network with a multilayer perceptron configured to determine a total score described above in relation to Table 2 may also be employed for the one or more machine learning models configured to predict a rating based on the at least one evaluation metric and/or at least one aggregated evaluation metric.
  • separate machine learning models may be trained for each of the at least one evaluation metric, and separate machine learning models may be trained for each of the at least one aggregated evaluation metric.
  • a first machine learning model may be trained to predict a rating based at least on a number of traffic participants with the following details:
  • a second machine learning model may be trained to predict a rating based at least on a number of potential collision events with the following details:
  • a third machine learning model may be trained to predict a rating based at least on a number of lane changes by other traffic participants with the following details:
  • method 1000 may comprise operation 1016 wherein the rating for each of the at least one evaluation metric and/or at least one aggregated evaluation metric is weighted to generate a weighted score.
  • the weighting of the rating of each of the at least one evaluation metric and/or at least one aggregated evaluation metric may be based on predefined weights.
  • method 1000 may comprise operation 1024 wherein a total score is generated based on the weighted score.
  • the total score may be generated by summing up the weighted scores generated in operation 1016.
  • the total score may be generated at various time intervals during a journey. For example, the total score may be generated every 10 minutes in a journey.
  • the total score may be mapped onto a predefined rating scale.
  • the output from operations 1008, 1016, and 1024 may be provided as a report to the driver or a third party.
  • Table 3below illustrates an example of the output from operations 908, 916, and 924 of method 900.

Landscapes

  • Engineering & Computer Science (AREA)
  • Automation & Control Theory (AREA)
  • Physics & Mathematics (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Physics (AREA)
  • Traffic Control Systems (AREA)

Abstract

A method and system for evaluation of the driving of an autonomous or non-autonomous driving vehicle is disclosed. The method comprising: receiving a plurality of images captured by a plurality of sensors, wherein the plurality of sensors is mounted on a vehicle and positioned to capture a scene on a side of the vehicle; selecting a representative sensor pose for the side of the vehicle based at least on the plurality of images; and determining at least one evaluation metric based on at least one image associated with the representative sensor pose. The method may further comprise generating a control signal to control the vehicle. A vehicle, computer program, machine-readable storage medium, and a data carrier signal are also disclosed.

Description

SYSTEM AND METHOD FOR EVALUATION OF THE DRIVING OF A VEHICLE
TECHNICAL FIELD
[0001] The invention relates generally to the evaluation of the driving of a vehicle, and more specifically to a system and method for evaluation of the driving of an autonomous or non- autonomous driving vehicle based on the surrounding environment.
BACKGROUND
[0002] With an increasing number of vehicles on the road, there is an increasing need to evaluate the driving of vehicles to ensure safety on the roads, whether the driving is by human operators or an autonomous system. There are several methods to evaluate the driving of a vehicle and systems have been developed to provide feedback or evaluation of driving, as well as to provide information as to how driving can be improved. One method of evaluating driving is through analysis of a driver’s skill while turning. As turning requires a driver to perform complicated manipulations such as manipulating the brake pedal, steering wheel, and accelerator pedal at the appropriate time, intensity, and angle, turning has been used in prior art to evaluate the driver’s skills. Other methods to evaluate the driver’s skill includes a comparison between a threshold and a vector calculated by synthesizing the longitudinal acceleration and the lateral acceleration.
[0003] The methods described tend to analyse and evaluate the driving of a vehicle by considering how the vehicle is manipulated. However, the driving of a vehicle cannot be considered in isolation as many factors contribute to the driving of a vehicle and associated safety level on the road, including the complexity of the environment surrounding the vehicle driven. In addition, evaluation of the driving of a vehicle cannot be considered at a single timepoint based on a specific manoeuvre of the vehicle. An accurate evaluation of the driving of a vehicle is therefore not achieved in existing methods which focus on the manipulation of a vehicle in isolation and based on a single manoeuvre of the vehicle.
SUMMARY
[0004] It is an object of the present invention to provide an improved method and system for the evaluation of the driving of a vehicle by using sensors to capture a scene surrounding an ego vehicle during a particular trip. Such sensor data is then used to evaluate the driving of the vehicle by determining, for example, the complexity of the scene surrounding the vehicle and number of potential accidents and/or collisions.
[0005] The object of the present invention is solved by the subject-matter of the independent claims, wherein further embodiments are incorporated into the dependent claims.
[0006] It shall be noted that all embodiments of the present invention concerning a method might be carried out with the order of the steps as described, nevertheless this has not to be the only and essential order of the steps of the method. The herein presented methods can be carried out with another order of the disclosed steps without departing from the respective method embodiment, unless explicitly mentioned to the contrary hereinafter.
[0007] To solve the above technical problems, the present invention provides a computer- implemented method for evaluation of the driving of an autonomous or a non-autonomous driving vehicle, the method comprising: receiving a plurality of images captured by a plurality of sensors, wherein the plurality of sensors is mounted on a vehicle and positioned to capture a scene on a side of the vehicle; selecting a representative sensor pose for the side of the vehicle based at least on the plurality of images; and determining at least one evaluation metric based on at least one image associated with the representative sensor pose.
[0008] The computer-implemented method of the present invention is advantageous over known methods as the method provides a better evaluation of the driving of a vehicle by accounting for the surroundings of the ego vehicle, wherein such evaluation may be used to increase overall driving safety. The computer-implemented method may be used to evaluate the driving of any vehicle, whether or not the driving is by a driver or an autonomous driving system. The computer-implemented method of the present invention provides information on the surroundings of the ego vehicle by using sensors to capture information about the scene surrounding the ego vehicle, and using the information captured by the sensors to analyse the scene surrounding the ego vehicle. The computer-implemented method of the present invention is also advantageous as the selection of a representative pose for each plurality of sensors based at least on the plurality of images ensures that the images subsequently used to analyse the scene surrounding the ego vehicle are the most informative and representative of the scene surrounding the ego vehicle.
[0009] A preferred method of the present invention is a computer-implemented method as described above, wherein the plurality of sensors captures at least a 180° field of view of the side of the vehicle, and/or each sensor of the plurality of sensors has an overlapping field of view with a neighbouring sensor, and/or each sensor of the plurality of sensors has a field of view of at least 60°.
[0010] The above-described aspect of the present invention has the advantage that as much information is obtained and captured about the surroundings as possible, by having a maximum field of view covered by each plurality of sensors, while ensuring that no portion of the environment surrounding the ego vehicle is missed out and keeping the number of sensors required low.
[0011] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein selecting a representative sensor pose for the side of the vehicle is further based on at least one novel image of the scene on the side of the vehicle.
[0012] The above-described aspect of the present invention has the advantage that using novel images of the scene on the side of the ego vehicle allows a more comprehensive evaluation and analysis of the scene surrounding the vehicle without the incorporation of additional sensors. In some situations, the novel images may also allow the detection of traffic participants which may otherwise be occluded or partially occluded in the individual received plurality of images.
[0013] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the at least one novel image of the scene on the side of the vehicle is generated by: receiving depth data corresponding to the plurality of images captured by the plurality of sensors; generating a representation of the scene on the side of the vehicle based on the received plurality of images and corresponding depth data, wherein the representation accounts for space and time; and generating at least one novel image of the scene on the side of the vehicle based on the representation of the scene on the side of the vehicle.
[0014] The above-described aspect of the present invention has the advantage that generating a representation of the scene that accounts for space and time allows the generation of novel images with perspectives of unseen poses or dimensions for a more comprehensive overview of the scene at the side of the ego vehicle as compared to only relying on images captured by the at least one plurality of sensors, which may comprise one or more partially or completely occluded objects depending on the sensor pose and the time at which the plurality of images and corresponding data are recorded or captured.
[0015] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein selecting a representative sensor pose for the side of the vehicle comprises: identifying a plurality of traffic participants in the plurality of images and/or at least one novel image; and selecting, from the plurality of images and/or at least one novel image, a representative image, wherein the representative sensor pose is a sensor pose associated with the representative image, and wherein preferably the representative image is the image that comprises the largest number of traffic participants.
[0016] The above-described aspect of the present invention has the advantage that by selecting an image with the most traffic participants as the representative image, the viewpoint corresponding to such representative image would most likely be the viewpoint from which most information of the scene at the side of the ego vehicle may be obtained.
[0017] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the at least one evaluation metric is based on a complexity of the scene at the side of the vehicle and one or more potential collision events.
[0018] The above-described aspect of the present invention has the advantage that using at least one evaluation metric based on the complexity of the scene surrounding the ego vehicle and one or more potential collision events allows a better evaluation of the driving of the ego vehicle in context, i.e., evaluating the driving based on how the vehicle is manoeuvred to prevent accidents or collisions while taking into account how complex the surroundings are.
[0019] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the at least one evaluation metric comprises one or more of: a number of potential collision events, a number of traffic participants, or a number of lane changes by other traffic participants.
[0020] The above-described aspect of the present invention has the advantage that the usage of evaluation metrics such as number of potential collision events, number of traffic participants, or number of lane changes by other traffic participants provide a better evaluation of the driving of the ego vehicle by considering how the vehicle is manoeuvred in response to the actions of other traffic participants.
[0021] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the at least one evaluation metric comprises a number of potential collision events, wherein a potential collision event is determined based on a time to near-collision for each traffic participant, wherein determining a number of potential collision events preferably comprises: receiving consecutive images associated with the representative sensor pose; predicting a time to near-collision for each traffic participant, more preferably using an image-based collision prediction model; and determining a number of potential collision events.
[0022] The above-described aspect of the present invention has the advantage that the evaluation of the driving of the vehicle is more accurate as the likelihood of a collision between the ego vehicle and another traffic participant is accounted for. Such likelihood of a collision is accounted for based on a time to near-collision, wherein the time to near- collision is the predicted time at which at least one traffic participant is going to come within a predefined distance from the vehicle. In general, the shorter the time to near-collision, the more unsafe the driving of the vehicle as it indicates that the vehicle is driving at a close proximity to other traffic participants. The usage of an image-based collision prediction model is more preferred as it only requires images as input to predict time to near-collision. [0023] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the at least one evaluation metric comprises a number of lane changes by other traffic participants, wherein determining a number of lane changes by other traffic participants preferably comprises: receiving consecutive images associated with the representative sensor pose; predicting a lane change behaviour for each traffic participant, more preferably using an image-based lane change model; and determining a number of lane changes by other traffic participants.
[0024] The above-described aspect of the present invention has the advantage that the evaluation of the driving of the vehicle is more accurate as the behaviour of other traffic participants is accounted for. Such behaviour of other traffic participants is accounted for based on the number of lane change behaviour. In general, the higher the number of lane changes by other traffic participants, the more complex the scene and the more difficult it is for the vehicle to navigate. The usage of an image-based lane change model is more preferred as it only requires images as input to predict lane change behaviour.
[0025] The above-described advantageous aspects of a computer-implemented method of the invention also hold for all aspects of a below-described computer-implemented method of the invention. All below-described advantageous aspects of a computer-implemented method of the invention also hold for all aspects of an above-described computer- implemented method of the invention.
[0026] The invention also relates to a computer-implemented method for evaluation of the driving of an autonomous or a non-autonomous driving vehicle, the method comprising: carrying out the computer-implemented method of any of the preceding claims on each side of the vehicle to determine at least one evaluation metric for each side of the vehicle; and generating at least one aggregated evaluation metric.
[0027] The computer-implemented method of the present invention is advantages as the evaluation of all sides of the ego vehicle allows information to be captured of the entire 360° view around the ego vehicle so as to obtain a complete understanding of the environment surrounding the ego vehicle. [0028] A preferred method of the present invention is a computer-implemented method as described above, further comprising: generating a control signal to control the vehicle based on the at least one evaluation metric and/or at least one aggregated evaluation metric, wherein the control signal is preferably configured to: generate an alert for a driver of the vehicle; generate an alert for other traffic participants; stop and/or adjust driving functions of the vehicle; and/or adjust an autonomous driving function controlling the vehicle.
[0029] The above-described aspect of the present invention has the advantage that the control signal may be configured to control the vehicle to address any unsafe driving behaviour indicated by the at least one evaluation metric and/or at least one aggregated evaluation metric to increase the overall driving safety on the roads.
[0030] The above-described advantageous aspects of a computer-implemented method of the invention also hold for all aspects of a below-described system of the invention. All below-described advantageous aspects of a system of the invention also hold for all aspects of an above-described computer-implemented method of the invention.
[0031] The invention also relates to system comprising at least one plurality of sensors, at least one processor and a memory that stores executable instructions for execution by the at least one processor, the executable instructions comprising instructions for performing the computer-implemented method of the invention.
[0032] The above-described advantageous aspects of a computer-implemented method or system of the invention also hold for all aspects of a below-described vehicle of the invention. All below-described advantageous aspects of a vehicle of the invention also hold for all aspects of an above-described computer-implemented method or system of the invention.
[0033] The invention also relates to a vehicle comprising the system of the invention, wherein the system comprises four plurality of sensors mounted on the vehicle and positioned to capture a scene surrounding the vehicle. [0034] The above-described advantageous aspects of a computer-implemented method, system, or vehicle of the invention also hold for all aspects of below-described computer program, a machine-readable storage medium, or a data carrier signal of the invention. All below-described advantageous aspects of a computer program, a machine-readable storage medium, or a data carrier signal of the invention also hold for all aspects of an abovedescribed computer-implemented method, system, or vehicle of the invention.
[0035] The invention also relates to a computer program, a machine-readable storage medium, or a data carrier signal that comprises instructions, that upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of a computer-implemented method according to the invention. The machine-readable medium may include any medium and/or mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device). The machine-readable medium may be any medium, such as for example, read-only memory (ROM); random access memory (RAM); a universal serial bus (USB) stick; a compact disc (CD); a digital video disc (DVD); a data storage device; a hard disk; electrical, acoustical, optical, or other forms of propagated signals (e.g., digital signals, data carrier signal, carrier waves), or any other medium on which a program element as described above can be transmitted and/or stored.
[0036] As used in this summary, in the description below, in the claims below, and in the accompanying drawings, the term “sensor” includes any sensor that detects or responds to some type of input from a perceived environment or scene. Examples of sensors include cameras, video cameras, LiDAR sensors, radar sensors, depth sensors, light sensors, colour sensors, or red, green, blue, and distance (RGBD) sensors.
[0037] As used in this summary, in the description below, in the claims below, and in the accompanying drawings, the term “sensor data” means the output or data of a device, also known as a sensor, that detects and responds to some type of input from the physical environment.
[0038] As used in this summary, in the description below, in the claims below, and in the accompanying drawings, the term “scene” refers to a distinct physical environment that may be captured by one or more sensors. A scene may include one or more objects that may be captured by one or more sensors, whether such object is stationary, static, or mobile. [0039] As used in this summary, in the description below, in the claims below, and in the accompanying drawings, the term “vehicle” refers to any mobile agent capable of movement, including cars, trucks, buses, agricultural machines, forklift, robots, whether or not such mobile agent is capable of carrying or transporting goods, animals, or humans, and whether or not such mobile agent is driven by a human or is an autonomous mobile agent.
[0040] As used in this summary, in the description below, in the claims below, and in the accompanying drawings, the term “ego vehicle” refers to the vehicle which is of primary interest in the evaluation of driving, whether or not such vehicle is driven by a human or is an autonomous driving system.
BRIEF DESCRIPTION OF THE DRAWINGS
[0041] These and other features, aspects, and advantages will become better understood with regard to the following description, appended claims, and accompanying drawings where:
[0042] Fig. 1 is a schematic illustration of a system for evaluation of the driving of an autonomous or non-autonomous driving vehicle, in accordance with embodiments of the present disclosure;
[0043] Fig. 2 is a flowchart of a method for evaluation of the driving of an autonomous or non-autonomous driving vehicle, in accordance with embodiments of the present disclosure;
[0044] Fig. 3 is a schematic illustration of a method of generating at least one novel image of the scene on the side of the vehicle, in accordance with embodiments of the present disclosure;
[0045] Fig. 4 is a schematic illustration of a method of generating a representation of a scene, in accordance with embodiments of the present disclosure;
[0046] Fig. 5 is a schematic illustration of a method of selecting a representative sensor pose for the side of the vehicle, in accordance with embodiments of the present disclosure;
[0047] Fig. 6 is a schematic illustration of a method of determining the evaluation metric of number of potential collision events, in accordance with embodiments of the present disclosure; [0048] Fig. 7 is a schematic illustration of a method of determining the evaluation metric of number of lane changes by other traffic participants, in accordance with embodiments of the present disclosure;
[0049] Fig. 8 is a schematic illustration of the method for evaluation of the driving of an autonomous or non-autonomous driving vehicle, in accordance with embodiments of the present disclosure;
[0050] Fig. 9 is a schematic illustration of a method of generating a control signal, in accordance with embodiments of the present disclosure; and
[0051] Fig. 10 is a schematic illustration of a method of generating a total score, in accordance with embodiments of the present disclosure.
[0052] In the drawings, like parts are denoted by like reference numerals.
[0053] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and executed by a computer or processor, whether or not such computer or processor is explicitly shown.
DETAILED DESCRIPTION
[0054] In the summary above, in this description, in the claims below, and in the accompanying drawings, reference is made to particular features (including method steps) of the invention. It is to be understood that the disclosure of the invention in this specification includes all possible combinations of such particular features. For example, where a particular feature is disclosed in the context of a particular aspect or embodiment of the invention, or a particular claim, that feature can also be used, to the extent possible, in combination with and/or in the context of other particular aspects and embodiments of the invention, and in the inventions generally.
[0055] In the present document, the word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment or implementation of the present subject matter described herein as “exemplary” is not necessarily be construed as preferred or advantageous over other embodiments.
[0056] While the disclosure is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however that it is not intended to limit the disclosure to the forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternative falling within the scope of the disclosure.
[0057] The present disclosure is directed to methods, systems, vehicles, computer programs, machine-readable storage media, and data carrier signals for evaluation of the driving of a vehicle. Evaluation of the driving of the vehicle is carried out by the usage of sensors to analyse the complexity of the scene surrounding the vehicle, wherein such complexity of the scene surrounding the vehicle is accounted for when evaluating the driving of the vehicle. In some embodiments, a representation of the scene surrounding the vehicle and novel images of the scene surrounding the vehicle may be generated in order to get a more comprehensive understanding and subsequent evaluation of the scene surrounding the vehicle. In some embodiments, the evaluation results may be aggregated over a journey for an even more comprehensive evaluation of the driving of the vehicle.
[0058] The following description sets forth exemplary methods, parameters, and the like. It should be recognized, however, that such description is not intended as a limitation on the scope of the present disclosure but is instead provided as a description of exemplary embodiments.
[0059] Although the following description uses terms “first,” “second,” etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another. For example, a first plurality of sensors could be termed a second plurality of sensors, and, similarly, a third plurality of sensors could be termed a first plurality of sensors, without departing from the scope of the various described embodiments. The first plurality of sensors, the second plurality of sensors, the third plurality of sensors, and the fourth plurality of sensors are all sets of sensors, but they are not the same plurality of sensors. [0060] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that on-going technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. The terms “comprises”, “comprising”, “includes” or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a setup, device or method that includes a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup or device or method. In other words, one or more elements in a system or apparatus proceeded by “comprises... a” does not, without more constraints, preclude the existence of other elements or additional elements in the system or method. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
[0061] Fig. 1 is a schematic illustration of a system for evaluation of the driving of an autonomous or non-autonomous driving vehicle, in accordance with embodiments of the present disclosure. System 100 for evaluation of the driving of a vehicle may comprise a plurality of sensors 108 and at least one processor 116. In some embodiments, the plurality of sensors 108 may be mounted on the vehicle and positioned to capture a scene on a side of the vehicle. In some embodiments, the plurality of sensors 108 may capture at least a 180° field of view of the side of the vehicle. In some embodiments, each sensor of the plurality of sensors 108 may have an overlapping field of view with a neighbouring sensor. In some embodiments, each sensor of the plurality of sensors 108 may have a field of view of at least 60°. In some embodiments, the plurality of sensors 108 may be mounted on a front of the vehicle, a rear of the vehicle, a left of the vehicle, or a right of the vehicle. In some embodiments, there may be four plurality of sensors 108 mounted on the vehicle to capture a 360° view of the scene surrounding the vehicle: a first plurality of sensors 108 mounted on a front of the vehicle, a second plurality of sensors 108 mounted on a right of the vehicle, a third plurality of sensors 108 mounted on a left of the vehicle, and a fourth plurality of sensors 108 mounted on a rear of the vehicle.
[0062] According to some embodiments, the plurality of sensors 108 may comprise visible light sensors to generate a plurality of images 124. A visible light sensor captures information relating to the colour of objects in a scene and generates images of a scene. In some embodiments, the visible light sensor may be a camera or video camera configured to capture a plurality of images 124.
[0063] According to some embodiments, the plurality of sensors 108 may further comprise depth sensors to generate depth data 132 corresponding to the plurality of images 124. A depth sensor captures information relating to the distances of surfaces of objects and the ground in a scene. In some embodiments, the depth sensor may be an infrared or lidar camera or video camera. In some embodiments, the plurality of sensors 108 may comprise red, green, blue, depth (RGBD) cameras which capture both visible light information and corresponding depth information to generate the plurality of images 124 with corresponding depth data 132. In some embodiments, the plurality of sensors 108 may comprise red, green, blue (RGB) cameras which capture visible light information to generate the plurality of images 124, and infrared or lidar cameras which capture depth information to generate depth data 132, and each of the plurality of images 124 may be associated with a corresponding depth data 132. In some embodiments, the plurality of sensors 108 may capture the plurality of images 124 and/or corresponding depth data 132 continuously over a period of time or over a journey. In some embodiments, the plurality of sensors 108 may capture the plurality of images 124 and/or corresponding depth data 132 at fixed time intervals over a period of time or over a journey.
[0064] According to some embodiments, each image of the plurality of images 124 and corresponding depth data 132 may be associated a timestamp corresponding to the timepoint at which such image of the plurality of images 124 and/or corresponding depth data 132 was captured. In some embodiments, the timepoint indicated by the timestamp may be the time elapsed since vehicle started to move. In some embodiments, the timepoint indicated by the timestamp may be a point in time as measured in hours past midnight or noon. In some embodiments, each image of the plurality of images 124 and its corresponding depth data 132 may each be associated with a sensor pose, wherein the sensor pose corresponds to the pose (position and orientation) of the specific sensor that captured such image of the plurality of images 124 and corresponding depth data 132. In some embodiments, each image of the plurality of images 124 and its corresponding depth data 132 may each be associated with one or more sensor parameters, wherein the one or more sensor parameters correspond to the parameters of the specific sensor that captured such image of the plurality of images 124 and corresponding depth data 132.
[0065] According to some embodiments, the plurality of images 124 and/or corresponding depth data 132, with the associated timestamp, sensor pose and/or one or more sensor parameters, may be received and processed by the at least one processor 116. In some embodiments, the at least one processor 116 may receive the plurality of images 124 and/or corresponding depth data 132 from the plurality of sensors 108. In some embodiments, the plurality of images 124 and corresponding depth data 132 from the one plurality of sensors 108 may be stored in a database 140 and may be retrieved from the database 140 to be processed by the at least one processor 116.
[0066] According to some embodiments, the at least one processor 116 may comprise several modules. In some embodiments, the at least one processor 116 may comprise an image processing module 148. In some embodiments, image processing module 148 may be configured to receive the plurality of images 124 and corresponding depth data 132 and generate at least one novel image 156 of the scene on the side of the vehicle. In some embodiments, the at least one novel image 156 may comprise one, some, or all possible viewpoints of the scene to get perspectives of unseen poses or dimensions of the scene at the side of the ego vehicle.
[0067] According to some embodiments, the at least one processor 116 may comprise a representative sensor pose selection module 164. In some embodiments, representative sensor pose selection module 164 be coupled to the image processing module 148 for example, by manner of one or both of wired coupling and wireless coupling. In some embodiments, representative sensor pose selection module 164 may receive the plurality of images 124 and/or at least one novel image 156 generated by the image processing module 148 to select a representative sensor pose for side of the vehicle based at least on the plurality of images 124 and/or at least one novel image 156. [0068] According to some embodiments, the at least one processor 116 may comprise an evaluation module 172. In some embodiments, the evaluation module 172 may be coupled to the representative sensor pose selection module 164 for example, by manner of one or both of wired coupling and wireless coupling. In some embodiments, the evaluation module 172 may determine at least one evaluation metric based on at least one image associated with the representative sensor pose selected by the representative sensor pose selection module 164. In some embodiments, the at least one image associated with the representative sensor pose may comprise the plurality of images 124 captured by the plurality of sensors 108. In some embodiments, the at least one image associated with the representative sensor pose may comprise the at least one novel image 156 generated by image processing module 148. In some embodiments, the at least one image associated with the representative sensor pose may comprise a combination of the plurality of images 124 captured by the plurality of sensors 108 and the at least one novel image 156 generated by image processing module 148. In some embodiments, the evaluation module 172 may generate at least one aggregated evaluation metric based on the at least one evaluation metric determined for all four sides of the vehicle. In some embodiments, the evaluation module 172 may monitor the at least one evaluation metric and/or at least one aggregated evaluation metric over a period of time. It is contemplated that, in accordance with an embodiment of this disclosure, the evaluation of driving of the vehicle, for example, may be communicated via an interactive dashboard (e.g., an electronic dashboard module) which can, for example, facilitate the provision of a report and/or feedback to a driver or a manufacturer of the vehicle. It is further contemplated that, in accordance with an embodiment of this disclosure, the evaluation of the driving of the vehicle can, for example, be communicated to a third-party service provider and/or other persons of interest.
[0069] According to some embodiments, the at least one processor 116 may comprise a control module 180 that is configured to generate a control signal for the vehicle based on the at least one evaluation metric and/or at least one aggregated evaluation metric determined by evaluation module 172. In some embodiments, the control module 180 may be coupled to the evaluation module 172 for example, by manner of one or both of wired coupling and wireless coupling. In some embodiments, the control signal may be adapted to control the vehicle. In some embodiments, the control signal may be generated when the at least one evaluation metric and/or at least one aggregated evaluation metric indicates an unsafe driving behaviour. In some embodiments, the control signal may be configured to generate an alert for a driver of the vehicle. The alert for the driver may take any form, including visual, auditory, tactile, or any combination thereof. Examples include an audible alarm or voice notification, a visible notification on a dashboard display, or a vibration. In some embodiments, the control signal may be configured to generate an alert for other traffic participants. The alert for the other traffic participants may take any form, including visual, auditory, or any combination thereof. Examples include the activation of the horn, or the flashing of the headlights. In some embodiments, the control signal may be configured to stop and/or adjust driving functions of the vehicle. For example, the control signal may be configured to stop driving functions of the vehicle and switch to an autonomous driving function. In another example, the control signal may be configured to limit the maximum speed at which the vehicle may travel at. In some embodiments, the control signal may be configured to adjust an autonomous driving function controlling the vehicle.
[0070] For the sake of convenience, the operations of the present disclosure are described as interconnected functional blocks or distinct software modules. This is not necessary, however, and there may be cases where these functional blocks or software modules are equivalently aggregated into a single logic device, program, or operation with unclear boundaries. In any event, the functional blocks and software modules or described features can be implemented by themselves, or in combination with other operations in either hardware or software. Each of the modules described may correspond to one or both of a hardware-based module and a software-based module. The present disclosure contemplates the possibility that each of the modules described may be an integrated software-hardware based module (e.g., an electronic part which can carry a software program/algorithm in association with receiving and processing functions/an electronic module programmed to perform the functions of receiving, processing and/or transmitting). The present disclosure yet further contemplates the possibility that each of the modules described can be an integrated hardware module (e.g., a hardware-based transceiver) capable of performing the functions of receiving, processing and/or transmitting.
[0071] Fig. 2 is a flowchart of a method for evaluation of the driving of an autonomous or non-autonomous driving vehicle, in accordance with embodiments of the present disclosure. Computer-implemented method 200 for evaluation of the driving of a vehicle may be implemented by a data processing device on any architecture and/or computing system. For example, various architectures employing, for example, multiple integrated circuit (IC) chips and/or packages, and/or various computing devices and/or consumer electronic (CE) devices such as multi -function devices, tablets, smart phones, etc., may implement the techniques and/or arrangements described herein. Method 200 may be stored as executable instructions that, upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of method 200. In some embodiments, method 200 for evaluation of the driving of a vehicle may be implemented by system 100.
[0072] According to some embodiments, method 200 for evaluation of the driving of a vehicle may commence at operation 208, wherein a plurality of images 124 captured by a plurality of sensors 108 mounted on a vehicle and positioned to capture a scene on a side of the vehicle is received. In some embodiments, the plurality of images 124 may be received from the plurality of sensors 108, for example, by manner of one or both of wired coupling and wireless coupling. In some embodiments, the plurality of images 124 may be stored on and retrieved from database 140, for example, by manner of one or both of wired coupling and wireless coupling. In some embodiments, the plurality of sensors 108 may capture at least a 180° field of view of the side of the vehicle which the plurality of sensors 108 is mounted on. In some embodiments, each sensor of the plurality of sensors 108 may have an overlapping field of view with a neighbouring sensor. In some embodiments, each sensor of the plurality of sensors 108 may have a field of view of at least 60°. In some embodiments, the plurality of images 124 may be captured by the plurality of sensors 108 over a period of time. In some embodiments, the plurality of images 124 may be captured by the plurality of sensors 108 at the same timepoint.
[0073] According to some embodiments, the plurality of sensors 108 may be mounted on a front of the vehicle, a rear of the vehicle, a left of the vehicle, or a right of the vehicle. In some embodiments, there may be two or more plurality of sensors 108 mounted on the vehicle. In some embodiments, there may be four plurality of sensors 108 mounted on the vehicle: a first plurality of sensors 108 mounted on a front of the vehicle, a second plurality of sensors 108 mounted on a right of the vehicle, a third plurality of sensors 108 mounted on a left of the vehicle, and a fourth plurality of sensors 108 mounted on a rear of the vehicle. In some embodiments, method 200 may be carried out for each of the first, second, third, and fourth plurality of sensors 108. [0074] According to some embodiments, method 200 may comprise operation 216 wherein a representative sensor pose is selected for the side of the vehicle based at least on the plurality of images 124. The representative sensor pose for the side of the vehicle is the sensor pose which is likely to provide the most information about the scene at such side of the ego vehicle. In some embodiments, operation 216 may be carried out by representative sensor pose selection module 164. In some embodiments, the representative sensor pose selected may be the pose of one of the sensors of the plurality of sensors. In some embodiments, the representative sensor pose selected may be an unseen sensor pose, which is a sensor pose that is not any of the poses of the sensors of the plurality of sensors. In some embodiments, the selection of a representative sensor pose may be based on the plurality of images 124 received in operation 208. In some embodiments, the selection of a representative sensor pose may be further based on at least one novel image 156 that may be generated in an operation 224. Operation 224 for the generation of at least one novel image 156 is discussed in further detail in relation to Figs. 3 and 4. Operation 216 for the selection of the representative sensor pose is discussed in further detail in relation to Fig. 5.
[0075] According to some embodiments, method 200 may comprise operation 232 wherein at least one evaluation metric is determined based on at least one image associated with the representative sensor pose selected in operation 216. In some embodiments, operation 232 may be carried out by evaluation module 172. In some embodiments, the at least one image associated with the representative sensor pose may comprise the plurality of images 124 received in operation 208, the at least one novel image 156 generated in operation 224, or some combination thereof. The at least one evaluation metric may be any evaluation metric that may be indicative of the scene at the side of the ego vehicle, and/or any interaction of the ego vehicle with the scene at the side of the ego vehicle. In some embodiments, the at least one evaluation metric may be based on a complexity of the scene at the side of the vehicle and one or more potential collision events. In some embodiments, the at least one evaluation metric may comprise one or more of: a number of potential collision events, a number of traffic participants, or a number of lane changes by other traffic participants. The determination of the evaluation metric of number of potential collision events is discussed in further detail in relation to Fig. 6, and the determination of the evaluation metric of number of lane changes by other traffic participants is discussed in further detail in relation to Fig. 7. Table 1 below is an example of an output from operation 232, wherein the at least one evaluation metric comprises: a number of potential collision events, a number of traffic participants, and a number of lane changes by other traffic participants.
Table 1
Figure imgf000021_0001
[0076] According to some embodiments, method 200 may comprise operation 240 wherein a control signal to control the vehicle is generated based on the at least one evaluation metric determined in operation 232. In some embodiments, operation 240 may be carried out by control module 180. In some embodiments, the control signal may be generated when the at least one evaluation metric indicates unsafe driving behaviour. In some embodiments, the control signal may be configured to control the vehicle. In some embodiments, the control signal may be configured to generate an alert for a driver of the vehicle, generate an alert for other traffic participants, stop and/or adjust driving functions of the vehicle, and/or adjust an autonomous driving function controlling the vehicle.
[0077] Fig. 3 is a schematic illustration of a method of generating at least one novel image of the scene on the side of the vehicle, in accordance with embodiments of the present disclosure. In some embodiments, method 300 of generating at least one novel image of the scene on the side of the vehicle may be carried out at operation 224 of method 200. In some embodiments, method 300 may be carried out by image processing module 148.
[0078] According to some embodiments, method 300 of generating at least one novel image of the scene on the side of the vehicle may comprise operation 308 wherein depth data 132 corresponding to the plurality of images 124 is received. In some embodiments, the corresponding depth data 132 may be received from the plurality of sensors 108, for example, by manner of one or both of wired coupling and wireless coupling. In some embodiments, the corresponding depth data 132 may be stored on and retrieved from database 140, for example, by manner of one or both of wired coupling and wireless coupling. [0079] According to some embodiments, method 300 may comprise operation 316 wherein a representation of the scene on the side of the vehicle is generated based on the received plurality of images 124 and corresponding depth data 132, wherein the representation of the scene accounts for space and time. In some embodiments, the representation of the scene may be generated based on the received plurality of images 124 captured over a period of time. The representation of the scene on the side of the vehicle may be generated using any known method used to generate scene representations that account for space and time.
[0080] An example of a representation that accounts for space and time is Neural Scene Flow Fields (NSFF), a representation that models a dynamic scene as a time-variant continuous function of appearance, geometry and three-dimensional (3D) motion. NSFF is described in detail in “Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes” by Li et. al. In general, the network disclosed by Li et. al. comprises two neural networks or multi-layer perceptrons (MLPs), corresponding to a static (time-independent) scene representation, and a dynamic (time-dependent) scene representation. A neural network generally comprises an input layer comprising one or more input nodes, one or more hidden layers each comprising one or more hidden nodes, and an output layer comprising one or more output nodes. An MLP is a type of neural network comprising a series of fully connected layers that connect every neuron in one layer to every neuron in the preceding and subsequent layer. The detailed architectures of the MLPs of the static and dynamic scene representations are illustrated in Figs. 2 and 3 respectively of the supplementary material of the paper by Li et. al. In general, the static scene representation is trained such that for a given a 3D location x = (x, y, z), and 2D viewing direction d= (9, cp), the MLP (F©) predicts an emitted colour c = (r, g, b), and volume density c. The static scene representation may thus be defined as (c, <J) = F0(x, d). In some embodiments, the static scene representation may be a neural radiance field (NeRF) as disclosed and described in detail in “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis” by Mildenhall et. al. In general, the dynamic scene representation is trained such that for a given a 3D location x = (x, y, z), 2D viewing direction d = (9, cp), and time i, the MLP
Figure imgf000022_0001
predicts an emitted colour c = (r, g, b), volume density cr, and forward and backword scene flow F; =
Figure imgf000022_0002
which denote 3D offset vectors that point to the position of x at times i + 1 and z -1 respectively. Motion disocclusions in 3D space are handled by predicting disocclusion weights I/F( =
Figure imgf000022_0003
The dynamic scene representation may thus be defined using the equation
Figure imgf000023_0001
d, ), wherein subscript i indicates a value at a specific time i.
[0081 ] Fig. 4 is a schematic illustration of a method of generating a representation of a scene, in accordance with embodiments of the present disclosure. In some embodiments, method 400 of generating a representation of the scene may be carried out at operation 316 of method 300. In some embodiments, method 400 may be carried out by image processing module 148. In this example, the plurality of sensors comprises 5 sensors, although it is emphasized that any other number of sensors may be employed. A different representation of a scene may be generated for each scene using the Adam optimizer with a learning rate of 0.0005. Each representation of the scene comprises two networks, i.e., a static scene representation and a dynamic scene representation, and both networks are optimised simultaneously. It is emphasised that method 400 is only an example of a method of generating a representation of a scene and other appropriate methods may be employed.
[0082] According to some embodiments, method 400 may be carried out with the plurality of images 124 and corresponding depth data 132 as input. In some embodiments, the received plurality of images 124 and corresponding depth data 132 may be captured sequentially by the plurality of sensors. For example, the plurality of images and corresponding data may be captured in a sequence from a left-most sensor to a right-most sensor of the plurality of sensors, or vice versa. For example, the images and corresponding data may be captured in sequence with a time difference of between 0.5 and 1 milliseconds. Each of the plurality of images 124 and corresponding depth data 132 is associated with a sensor pose (position and orientation) and intrinsic parameters of the specific sensor that captured such image and corresponding depth data. Each of the plurality of images 124 and corresponding depth data 132 may also associated with a timestamp corresponding to the time at which such image and corresponding depth data is captured.
[0083] According to some embodiments, method 400 may comprise operation 408, wherein a static scene representation 416 is generated. According to some embodiments, the static scene representation 416 may be a 5D scene representation comprising a multilayer perceptron. According to some embodiments, the static scene representation 416 may be a 5D scene representation approximated with two multilayer perceptron networks, which may be termed as a coarse network and a fine network, wherein both multilayer perceptron networks may have the same architecture. In some embodiments, operation 408 of generating static scene representation 416 may comprise the following steps: a) Sample a batch of camera rays from the set of all pixels in the received plurality of images 124. Hierarchical sampling procedure may be used to query Nc samples from the coarse network and Nc + Nf samples from the fine network. A batch size of 4096 rays may be used, each sampled at Nc = 64 coordinates in the coarse volume and Nf = 128 additional coordinates in the fine volume. b) Pass the positional encoding of input 3D location x to 8 fully-connected layers (using ReLU activations and 256 channels per layer), to output c and a 256-dimensional feature vector. c) Concatenate the feature vector from the preceding step with the camera ray’s viewing direction and pass the concatenated vector to one additional fully-connected layer (using a ReLU activation and 128 channels) to output the view-dependent emitted colour c. d) Perform steps (b) and (c) separately with the sample inputs obtained using Nc and Nc + Nf respectively. e) Perform volumetric rendering to render the colour of each ray for both set of samples. f) Compute the loss which is simply the total squared error between the rendered and true pixel colours for both the coarse and fine renderings. g) Update the parameters using Adam optimizer with a learning rate that begins at 5 * IO-4 and decays exponentially to 5 * 10'5 over the course of optimization (other Adam hyperparameters left at default values of Pi = 0.9, f>2 = 0.999, and e = 10'7). h) Set the optimisation step to between 100K - 300K iterations.
Additional information on the training and architecture of the static scene representation 416 may be found in the paper by Mildenhall et. al.
[0084] According to some embodiments, method 400 may comprise operation 424, wherein a dynamic scene representation 432 is generated. In some embodiments, the plurality of images 124 and their corresponding depth data 132 may be ordered in sequence, wherein the ordering sequence represents time i. In some embodiments, the plurality of images 124 and corresponding depth data 132 may be ordered based on the sequence in which such images and corresponding depth data were captured. The final dataset used to generate the dynamic representation of the scene thus comprises the images, the corresponding sensor poses, the corresponding depth data, intrinsic parameters, and the time. In some embodiments, operation 424 of the generation of dynamic scene representation 432 may comprise the following steps: a) Sample a batch of camera rays from the set of all pixels in the received plurality of images 124. b) Pass the positional encoding of input 3D location x to 8 fully-connected layers (using ReLU activations and 256 channels per layer), to output 07, F, W and a 256- dimensional feature vector. c) Concatenate the feature vector from the preceding step with the camera ray’s viewing direction and pass the concatenated vector to one additional fully-connected layer (using a ReLU activation and 128 channels) to output the view-dependent emitted colour c,. d) Perform volumetric rendering to render the scene from time i from 1) the perspective of the camera at time i and 2) with the scene warped from time j to z, so as to undo any motion that occurred from time z to j. This yields a rendered image at time j with both camera and scene motion warped to time z. e) Use neighbouring times as N(i) = {z, z ± 1, z ± 2}, and chain scene flow and disocclusion weights for the z ± 2 cases. Note that when j = i, there is no scene flow warping or disocclusion weights involved. f) Compute the temporal photometric loss (Lpho). g) Compute the cyclic consistency loss (Lcyc). h) Compute the data driven loss (Ldata = Lgeo + /LU, where
Figure imgf000025_0001
= 2). i) Update the parameters using Adam optimizer with a learning rate that begins at 5 * 10'4 and decays exponentially to 5 * 10'5 over the course of optimization (other Adam hyperparameters left at default values of Pi = 0.9, f>2 = 0.999, and e = 10'7). j) Set the optimisation step to between 100K - 300K iterations.
Additional information on the training, architecture, and loss functions of the dynamic scene representation may be found in the paper by Li et. al., as well as the associated supplementary material. [0085] According to some embodiments, method 400 may comprise operation 440 wherein a combined volume rendered image 448 is generated. In some embodiments, operation 440 may comprise combining volume rendered images from the static scene representation 416 and dynamic scene representation 432. In some embodiments, operation 440 may comprise performing a linear combination of static scene components from static scene representation 416 and dynamic scene components from dynamic scene representation 432.
[0086] According to some embodiments, method 400 may comprise operation 456 wherein a total loss 464 is calculated based on the combined volume rendered image 448. In some embodiments, operation 456 may comprise combining with weights a combined static and dynamic loss (Zc/>), a weighted temporal photometric loss (Lpho a cycle consistency loss (Lcyc), a geometric consistency loss (Lgeo), and a single-view depth loss (Ldata). In some embodiments, the geometric consistency loss may be computed using pre-trained networks such as FlowNet 2.0 available at https://github.com/lmb-freiburg/flownet2or RAFT available at https://github.com/princeton-vl/RAFT. In some embodiments, the single-view depth loss may be computed using a pre-trained single-view depth network. An example of a pre-trained single-view depth network is MiDAS network as disclosed in “Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer” by Ranftl et. al. pre-trained on the KITTI dataset available at https://www.cvlibs.net/datasets/kitti/. In some embodiments, the MiDAS network may be trained on the KITTI dataset which comprises 93,000 images split into a training and a test dataset with the ratio 80:20. The MiDAS network uses a ResNet-based architecture with pretrained ImageNet weights, wherein an example of the ResNet may be found at least in Figure 3 of “Deep Residual Learning for Image Recognition” by He et. al. In particular, the MiDAS network uses additional scale- and shift-invariant losses when training the ResNet architecture. The details of the training of the MiDAS network may be found at least in Section 6 of the paper by Ranftl et. al. and additional detail on the scale- and shift-invariant losses may be found at least in Section 5 of the same paper. In some embodiments, the MiDAS network may be trained on the KITTI dataset with a depth cap of 80 meters for 100 epochs with a batch size of 24. . It is contemplated that any other appropriate network may be employed to calculate the geometric consistency loss and the single-view depth loss. In some embodiments, operation 456 may comprise calculating the total loss 464 based on the equation L = Lch + LphP + cycLcyc + dataLdata + PregLreg, wherein the [> coefficients weight each term. In some embodiments, the weight terms may be initialized as cyc = 1, reg = 0.1, Pdata £ {0.2, 0.4}.
[0087] According to some embodiments, method 400 may comprise operation 472 wherein the static scene representation 416 and dynamic scene representation 432 are optimized based on the total loss 464 calculated in operation 456. In some embodiments, the static scene representation 416 and dynamic scene representation 432 may be refined and adjusted over multiple iterations, wherein each iteration may comprise a cycle comprising operations 440, 456, and 472. In some embodiments, the static scene representation 416 and dynamic scene representation 432 may be optimized over at least 1,000,000 iterations, wherein the first 10007V iterations are denoted as an initialization stage to warm up the optimization, wherein TV is the number of training views. In some embodiments, during the initialization stage, the scene flow window size may be initialized as 3, wherein j G z, z ± 1. In some embodiments, after the initialization state, the scene flow window size may be switched to 5, wherein j E N(i) = {z, z ± 1, z ± 2}.
[0088] Returning to Fig. 3, method 300 may comprise operation 324 wherein at least one novel image 156 is generated of the scene based on the scene representation generated in operation 316. In some embodiments, volume rendering techniques may be carried out on the static scene representation 416 and the dynamic scene representation 432, and the output of such volume rendering may be combined by performing a linear combination of static and dynamic scene components to generate at least one novel image 156. The at least one novel image 156 generated may be based on one, some, or all possible viewpoints of the scene to get perspectives of unseen poses or dimensions to increase the comprehensiveness of the evaluation of the driving of the vehicle. In some embodiments, the at least one novel image 156 may be generated based on space-time interpolation based on predefined poses. In some embodiments, space-time interpolation may be performed using a splatting-based plane-sweep volume tracing approach. In particular, to render an image at intermediate time d, G (0,1) at a specified novel viewpoint, every step emitted from the novel viewpoint is swept from front to back, and at each sampled step t along the ray, point information is queried through the static scene representation 416 and dynamic scene representation 432 at both times z and z + 1, and the 3D points are displaced at time z by the scaled scene flow x, + < ^+Kxd> and similarity for time z+1. The 3D displaced points may then be splatted onto a (c, a) accumulation buffer at the novel viewpoint, and spats may be blended from time z and z + 1 with linear weights 1 - dt, 3 wherein the final rendered view is obtained by volume rendering the accumulation buffer. Additional information on the splatting-based planesweep volume tracing approach may be found at least in Section 3.4 of the paper by Li et. al., and Section 4 of the associated supplementary material.
[0089] Fig. 5 is a schematic illustration of a method of selecting a representative sensor pose for the side of the vehicle, in accordance with embodiments of the present disclosure. In some embodiments, method 500 of selecting a representative sensor pose for the side of the vehicle may be carried out at operation 216 of method 200. In some embodiments, method 500 may be carried out by representative sensor pose selection module 164.
[0090] According to some embodiments, method 500 may comprise operation 508 wherein a plurality of traffic participants are identified in the plurality of images 124 and/or at least one novel image 156. In some embodiments, the identification of traffic participants in operation 508 may be carried out by using an object detection algorithm or model on each image to identify all objects present in each image, and filtering the results obtained. An object detection algorithm or model generally performs image recognition tasks by taking an image as input and then predicting bounding boxes and class probabilities for each object in the image. Any known object detection model or algorithm may be used, including convolutional neural networks (CNN), single-shot detectors or support vector machines. Most object detection models use a CNN to extract features from an input image to predict the probability of learned classes. An example of an object detection model that may be employed is the You Only Look Once v7 (YOLOv7) algorithm as disclosed in “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors” by Wang et. al., wherein information on the architecture of YOLOv7 may be found at least in Section 3, and information on the training of YOLOv7 may be found at least in Section 5. In some embodiments, the object detection model used to detect or predict one or more objects within each image of the plurality of images 124 and/or at least one novel image 156 may be the pretrained YOLOv7 model available at https://github.com/WongKinYiu/yolov7. In some embodiments, after all the objects present in each image have been detected, the results may be filtered to only contain objects that correspond to classes of traffic participants. Examples of classes of traffic participants include car, bus, truck, pedestrian, motorcycle, ambulance, bicycle, bicycle wheels, personal mobility devices, etc. [0091] According to some embodiments, method 500 may comprise operation 516 wherein a representative image is selected, wherein the representative sensor pose is a sensor pose associated with the representative image. In some embodiments, the representative image may be selected from the plurality of images 124 and/or at least one novel image 156. In some embodiments, the representative image may be selected based on the number of traffic participants detected in each image, wherein the representative image is preferably the image with the largest number of traffic participants detected in operation 508. In some embodiments, where two or more images each comprise the largest number of traffic participants detected in operation 508, the representative image may be selected randomly from the two or more images. Once the representative image is selected, the sensor pose associated with such representative image is selected as the representative sensor pose.
[0092] Fig. 6 is a schematic illustration of a method of determining the evaluation metric of number of potential collision events, in accordance with embodiments of the present disclosure. In some embodiments, the number of potential collision events may be determined using any known collision prediction algorithms. In some embodiments, the number of potential collision events may be determined using a collision detection algorithm configured to receive image data and sensor data as input for depth scene analysis to localise and predict the relative movements of traffic participants to the vehicle. For example, when the representative sensor pose corresponds to the pose of one of the plurality of sensors, the image and corresponding depth data captured by the specific sensor may be used as input to such collision detection algorithm. In some embodiments, the number of potential collision events may be determined by a model configured receive image data to identify traffic participants and predict a risk estimator based on temporal cues. Such models use the power of large data to predict potential collisions, without the need of relying on explicit geometric depth estimates or velocity information. It is emphasised that method 600 illustrated in Fig. 6 is an example of a method of determining the evaluation metric of number of potential collision events, and other methods may be used.
[0093] According to some embodiments, method 600 of determining the evaluation metric of number of potential collision events may be carried out in operation 232 of method 200. In some embodiments, method 600 of determining the evaluation metric of number of potential collision events may be carried out by evaluation module 172. [0094] According to some embodiments, method 600 may comprise operation 608 wherein consecutive images associated with the representative sensor pose are received. In some embodiments, when analysing the number of potential collision events at a current timepoint, the consecutive images may be images associated with previous or preceding timepoints. In some embodiments, the consecutive images may be within 0.5 to 1 milliseconds of each other. Where the representative sensor pose corresponds to the sensor pose of one of the sensors of the plurality of sensors, the consecutive images may be obtained from such sensor. Where the representative sensor pose corresponds to an unseen pose, the consecutive images may be generated using the representation of the scene generated in operation 316 of method 300, in particular, by performing fixed-view time interpolation, wherein the fixed-view corresponds to the representative sensor pose. Additional information on performing fixed- view time interpolation may be found at least in Section 3.4 of the paper by Li et. al., and Section 4 of the associated supplementary material.
[0095] According to some embodiments, method 600 may comprise operation 616 wherein a time to near-collision is predicted for each traffic participant. The time to near-collision is the predicted time at which at least one traffic participant is going to come within a predefined distance from the vehicle. In some embodiments, the predefined distance may be 1 metre. In some embodiments, a collision prediction model may be employed in operation 616. In some embodiments, an image-based collision prediction model may be employed in operation 616.
[0096] According to some embodiments, the image-based collision prediction model may comprise a convolutional neural network (CNN). A convolutional neural network (CNN) is a multi-layered feed-forward neural network, made by stacking many hidden layers on top of each other in sequence. The sequential design may allow CNNs to learn hierarchical features. The hidden layers are typically convolutional layers followed by activation layers, some of them followed by pooling layers. The CNN may be configured to identify pattens in data. The convolutional layer may include convolutional kernels that are used to look for patterns across the input data. The convolutional kernel may return a large positive value for a portion of the input data that matches the kernel’s pattern or may return a smaller value for another portion of the input data that does not match the kernel’s pattern. A CNN is preferred as a CNN may be able to extract informative features from the training data without the need for manual processing of the training data. Also, CNN is computationally efficient as a CNN is able to assemble patterns of increasing complexity using the relatively small kernels in each hidden layer. An example of an image-based collision prediction model that may be employed is the multi-stream VGG-16 architecture disclosed in “Forecasting Time-to- Collision from Monocular Video: Feasibility, Dataset, and Challenges” by Manglik et. al., wherein each stream takes a 224 x 224 RGB frame as input to extract spatial features, which are then concatenated across all frames preserving the temporal order and then fed into a fully-connected layer to output time to collision. Additional details of the architecture may be found at least in section IV.B and Fig. 4 of the paper by Manglik et. al.
[0097] According to some embodiments, the image-based collision prediction model may be trained on a near-collision dataset. In some embodiments, the near-collision dataset may comprise labelled sensor data recorded by sensors mounted on a data collection vehicle. In some embodiments, the sensor data of the training dataset may be generated by sensors mounted on all four sides of a data collection vehicle driven through multiple road environments and situations (e.g., roads and highways), during multiple time situations (e.g., peak hours, non-peak hours, morning, and night), different traffic participant densities, and different weather conditions (e.g., spring, summer, autumn, winter, rainy, sunny, and cloudy). In some embodiments, the sensor data may comprise image data captured by image sensor(s) and depth data captured by depth sensor(s) (e.g., LIDAR) for automatic ground truth label generation. In some embodiments, the sensors may be initially calibrated with correspondences. The calibration allows accurate projection of LIDAR points onto images to obtain estimated depth values for the image-based traffic participant detection. In some embodiments, YOLOv7 may be used to identify and generate bounding boxes around traffic participants in the images. In some embodiments, the 3D position of each detected bounding box may be obtained by computing a median distance of the pixels using the 3D point cloud. Each image may be annotated with a binary label where a positive label indicates the presence of at least one traffic participant within a predefined distance (e.g., one metre) from the sensor. In some embodiments, a time to near-collision may be calculated for each traffic participant based on a short temporal history of few image frames. For example, when using a tuple of N consecutive image frames (//, h, ..., IN) recorded at a frame rate of 10 fps as history, estimation of proximity over the next 6 seconds may be carried out by looking at the next 60 binary labels in future annotated as {labeln+i, labeln+2, ..., labeln+6o}. If the index of first positive label in the example sequence of labels is denoted as T, then the ground truth T time to near collision may be t = — seconds. In some embodiments, the dataset may comprise 15,000 video sequences, split into 12,000 for training set and 3,000 for test set. In the test set, the count of data points for each time-to-collision interval should preferably be balanced. The time-to-collision interval may vary between 0 - T seconds. In some embodiments, T may be set to 6 which results in. 0-1, 1-2, 2-3, 3-4, 4-5, 5-6 intervals in seconds. In some embodiments, the image-based collision prediction model may be trained on the near-collision dataset with the following steps: a) Initialise a VGG-16 network using weights pre-trained using the ImageNet dataset available at https://image-net.org/. b) Train the multi-stream architecture with shared weights using the above-described near-collision dataset for 100,000 epochs. c) Compute the loss which is a mean square error between the predicted time, i.e.,/ (Ii,
I 2, IN) and ground truth time denoted as ttrUe. d) Optimise the loss using mini-batch gradient descent of batch size 24 with the learning rate of 0.001.
It is contemplated that any other appropriate image-based collision prediction model and/or training method may be employed.
[0098] According to some embodiments, method 600 may comprise operation 624 wherein a number of potential collision events is determined. In some embodiments, the number of potential collision events may be determined based on a number of traffic participants within a predefined distance from the vehicle. For example, the number of potential collision event may correspond to the number of traffic participants within 1 metre from the vehicle. In some embodiments, the number of potential collision events may be based on a time to near- collision for each traffic participant, wherein the time to near-collision corresponds to a time at which at least one traffic participant is predicted to come within a predefined distance of the ego vehicle. In some embodiments, the predefined distance may be 1 metre. For example, the number of potential collision events may correspond to the number of traffic participants within 1 metre from the ego vehicle with a predicted time to near-collision of between 2.6 and 4 seconds. [0099] Fig. 7 is a schematic illustration of a method of determining the evaluation metric of number of lane changes by other traffic participants, in accordance with embodiments of the present disclosure. In some embodiments, the number of lane changes by other traffic participants may be determined using any known lane change model configured to receive as input one or more consecutive image frames and to generate as output a lane change behaviour for each traffic participant. In some embodiments, the lane change model may be an image-based lane change model configured to determine a lane change behaviour for each traffic participant based on action recognition in image sequences captured by sensors. It is emphasised that method 700 illustrated in Fig. 7 is an example of a method of determining the evaluation metric of number of lane changes by other traffic participants, and other methods may be used.
[0100] In some embodiments, method 700 of determining the evaluation metric of number of lane changes by other traffic participants may be carried out in operation 232 of 200. In some embodiments, method 700 of determining the evaluation metric of number of lane changes by other traffic participants may be carried out by evaluation module 172. Method 700 of determining the evaluation metric of number of lane changes by other traffic participants may preferably be carried out in embodiments where the plurality of sensors 108 is mounted on the front of the vehicle.
[0101] According to some embodiments, method 700 may comprise operation 708 wherein consecutive images associated with the representative sensor pose are received. In some embodiments, when analysing the number of lane changes by other traffic participants at a current timepoint, the consecutive images may be images associated with previous or preceding timepoints. In some embodiments, the consecutive images may be within 0.5 to 1 milliseconds of each other. Where the representative sensor pose corresponds to a sensor of the plurality of sensors, the consecutive images may be obtained from such sensor. Where the representative sensor pose corresponds to an unseen pose, the consecutive images may be generated using the representation of the scene generated in operation 316 of method 300, in particular, by performing fixed-view time interpolation, wherein the fixed-view corresponds to the representative sensor pose. Additional information on performing fixed- view time interpolation may be found at least in Section 3.4 of the paper by Li et. al., and Section 4 of the associated supplementary material. [0102] According to some embodiments, method 700 may comprise operation 716 wherein a lane change behaviour is determined for each traffic participant. In some embodiments, the lane change behaviour of each traffic participant may be classified into a left lane change (LLC), right lane change (RLC), and no lane change (NLC). In some embodiments, a lane change model may be employed in operation 716 to predict a lane change behaviour for each traffic participant. In some embodiments, the lane change model may be an image-based model that determines whether each traffic participant in an image captured at a certain timepoint is changing lanes by looking at images captured at previous or preceding timepoints, also termed as an observation horizon or time window. Such image-based lane change model may implicitly include positional, contextual, and symbolic information. An example of an image-based model is the spatio-temporal model named SlowFast network disclosed in “SlowFast Networks for Video Recognition” by Feichtenhofer et. al. wherein an example of the architecture may be found at least in Fig. 1 and section 3. In general, the SlowFast network uses a two-stream approach, wherein a first (slow) stream is designed to capture semantic information given by a few sparse images operating at low frame rates and slow refreshing speed, and a second (fast) stream is responsible for capturing rapidly changing motion by operating at high temporal resolution and fast refreshing speed. Both streams use ResNet50 as a backbone and are fused by lateral connections. The temporal stride in the first (slow) stream may be T = 16, and the frame rate ratio between the second (fast) steam and first (slow) stream may be a = 8. The ratio of channels of the first (slow) stream with respect to the second (fast) stream may be ? = -. The SlowFast network may 8 be defined with one convolutional layer, five residual blocks, and one fully connected layer adapted to three classes, namely left lane change (LLC), right lane change (RLC), and no lane change (NLC). In some embodiments, the SlowFast network or spatio-temporal model may be trained using a dataset comprising sets of videos for lane change classification classes, such as LLC, RLC, and NLC. An example of a dataset that may be used is the PREVENTION dataset generated in “The PREVENTION dataset: a novel benchmark for PREdiction of VEhicles iNTentlONs” by Izquierdo et. al., which comprises data from 3 radars, 2 cameras and 1 LiDAR, covering a range of up to 80 meters around the ego-vehicle (up to 200 meters in the frontal area), wherein road lane markings are included, and the final position of the vehicles is provided by fusing data from the three types of sensors, and the sets of videos are sorted into three classes, namely LLC, RLC, and NLC. An example of the training of the SlowFast network for usage in operation 716 of method 700 is as follows: a) Initialise the SlowFast network with random weights b) Set the input size for both steams at 112 x 112. c) Divide the PREVENTION dataset into a training dataset (85%) and validation dataset (15%) d) For the temporal domain, randomly sample a clip (of aT X T frames) from the full- length view, wherein the input into the slow and fast streams are T and aT frames respectively. e) Train for 15,000 iterations with linear warm-up for the first 1,000 iterations. f) Set the learning rate as step-wise and reduce the learning rate 10x when validation error saturates. g) Set the weight decay at 10'7.
It is contemplated that any other appropriate neural network architecture, dataset, and/or training method may be employed.
[0103] According to some embodiments, operation 716 may comprise uniformly sampling a number of clips from the consecutive images received in operation 708, stacking the clips with an image associated with the current timepoint, and cropping the stacked images with the same region of interest (ROI) as used during the training of the lane change model before inputting the images into the trained lane change model.
[0104] According to some embodiments, method 700 may comprise operation 724 wherein a number of lane changes by other traffic participants is determined. In some embodiments, the number of lane changes by other traffic participants may correspond to the number of traffic participants classified as left lane change (LLC) or right left change (REC) by the lane change model in operation 716.
[0105] Fig. 8 is a schematic illustration of the method for evaluation of the driving of an autonomous or non-autonomous driving vehicle, in accordance with embodiments of the present disclosure. Computer-implemented method 800 for evaluation of the driving of a vehicle may be implemented by a data processing device on any architecture and/or computing system. For example, various architectures employing, for example, multiple integrated circuit (IC) chips and/or packages, and/or various computing devices and/or consumer electronic (CE) devices such as multi-function devices, tablets, smart phones, etc., may implement the techniques and/or arrangements described herein. Method 800 may be stored as executable instructions that, upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of method 800. In some embodiments, method 800 for evaluation of the driving of a vehicle may be implemented by system 100, wherein system 100 comprises multiple plurality of sensors 108 mounted on the vehicle.
[0106] According to some embodiments, method 800 may comprise operation 808 wherein at least one evaluation metric is determined for each side of the vehicle. In some embodiments, the at least one evaluation metric may be determined by carrying out method 200 on each side of the vehicle. In some embodiments, at least one evaluation metric may be determined for a left side of the vehicle, at least one evaluation metric may be determined for a right side of the vehicle, at least one evaluation metric may be determined for a front side of the vehicle, and at least one evaluation metric may be determined for a rear side of the vehicle.
[0107] According to some embodiments, method 800 may comprise operation 816 wherein at least one aggregated evaluation metric is generated. In some embodiments, the at least one aggregated evaluation metric may be generated by summing up each of the at least one evaluation metric determined in operation 808. In some embodiments, the at least one aggregated evaluation metric may be generated by summing up each of the at least one evaluation metric determined for a left side of the vehicle, at least one evaluation metric determined for a right side of the vehicle, at least one evaluation metric determined for a front side of the vehicle, and at least one evaluation metric determined for a rear side of the vehicle.
[0108] According to some embodiments, method 800 may comprise operation 824 wherein a control signal is generated based on the at least one aggregated evaluation metric generated in operation 816. In some embodiments, the control signal may be generated when the at least one evaluation metric indicates unsafe driving behaviour. In some embodiments, the control signal may be configured to control the vehicle. In some embodiments, the control signal may be configured to generate an alert for a driver of the vehicle, stop driving functions of the vehicle, and/or adjust an autonomous driving function controlling the vehicle. [0109] Fig. 9 is a schematic illustration of a method of generating a control signal, in accordance with embodiments of the present disclosure. Method 900 of generating a control signal may be carried out by control module 180 of system 100. In some embodiments, method 900 of generating a control signal may be carried out at operation 232 of method 200 and/or at operation 824 of method 800.
[0110] According to some embodiments, method 900 may comprise operation 908 wherein a total score is generated based on the at least one evaluation metric determined in operation 232 of method 200 and/or the at least one aggregated evaluation metric generated in operation 816 of method 800. In some embodiments, the total score may be generated at various time intervals during a journey. For example, the total score may be generated every 10 minutes in a journey. In some embodiments, the total score may be generated using a single machine learning model configured to receive the at least one evaluation metric determined in operation 232 of method 200 and/or the at least one aggregated evaluation metric generated in operation 816 of method 800 and determine a total score. In some embodiments, the total score may be generated using method 1000 described in relation to Fig. 10. In some embodiments, the total score may be provided as a report to the driver or a third party.
[0111] In some embodiments, an annotated dataset may be generated to train the machine learning model configured to receive the at least one evaluation metric determined in operation 232 of method 200 and/or the at least one aggregated evaluation metric generated in operation 816 of method 800 and determine a total score. In some embodiments, the annotated dataset may comprise at least 2,500 instances. An example of the steps to generate the annotated dataset is as follows: a) Collect data by driving a data collection vehicle driven through multiple road environments and situations (e.g., roads and highways), during multiple time situations (e.g., peak hours, non-peak hours, morning, and night), different traffic participant densities, and different weather conditions (e.g., spring, summer, autumn, winter, rainy, sunny, and cloudy). b) Determine the at least one evaluation metric and/or at least one aggregated evaluation for the data collected. c) Annotate the data with a total score. The annotation may be carried out by an expert. [0112] In some embodiments, the single machine learning model may be a trained neural network with a multilayer perceptron architecture. Table 2 below illustrates an example of the architecture of a trained neural network with a multilayer perceptron architecture, wherein N represents the number of inputs into the trained neural network.
Table 2
Figure imgf000038_0001
[0113] When training the neural network with a multilayer perceptron architecture, the activation functions may be set to be the commonly used sigmoid activation function or ReLU activation function and the weights may be randomly initialized to numbers between 0.01 and 0.1, while the biases may be randomly initialized to numbers between 0.1 and 0.9. The neural network may be adjusted by using a cost function, such as the commonly used Mean Error (ME), Mean Squared Error (MSE), Mean Absolute Error (MAE), or Root Mean Squared Error (RMSE). The neural network may be adjusted by backpropagating the cost function to update the weights and biases using an AdamOptimizer. In some embodiments, neural network may be trained with a learning rate of 0.001 and weight decay of 0.0005. In some embodiments, the neural network may be trained from scratch for 50 to 100 epochs, although the number of epochs may vary depending on the size of the training dataset and/or the size of neural network.
[0114] For example, a single machine learning model may be trained to determine a total score based on the at least one evaluation metric and/or at least one aggregated evaluation metric with the following details:
• Dataset split: 60:20:20 (training : verification : testing) • Input: place, journey start datetime, journey end datetime, weather, other participant lane change count, potential collision event count, traffic participant count, speed of vehicle, peak or not peak hours, etc.
• Target: Total score between 1 and 100
• Pre-processing: All numerical column including the target are normalized and one-hot encoding applied to categorical columns before passing to the network
• Output: total score (mapped from 0-1 to 1-100)
[0115] According to some embodiments, method 900 may comprise operation 916 wherein a control signal is generated based on the total score generated in operation 908. In some embodiments, a control signal to control the vehicle may be generated when the total score indicates an unsafe driving behaviour. For example, where the maximum score is 100, a score below 40 may indicate unsafe driving behaviour. In some embodiments, the total score may be generated and monitored over a period of time before the control signal is generated. In some embodiments, the control signal may be generated only when the total score generated over a period of time consistently indicates an unsafe driving behaviour.
[0116] According to some embodiments, where the total score generated indicates unsafe driving behaviour, the control signal may be configured to generate an alert to the driver, generate an alert to other traffic participants, or stop and/or adjust driving functions. In some embodiments, where the total score generated indicates unsafe driving behaviour, the control signal may be configured to adjust an autonomous driving function controlling the vehicle.
[0117] Fig. 10 is a schematic illustration of a method of generating a total score, in accordance with embodiments of the present disclosure. Method 1000 of generating a control signal may be carried out by control module 180 of system 100. In some embodiments, method 1000 of generating a total score may be carried out at operation 908 of method 900.
[0118] According to some embodiments, method 1000 may comprise operation 1008 wherein a rating is determined for each of the at least one evaluation metric and/or at least one aggregated evaluation metric. In some embodiments, the rating may be determined by mapping each of the at least one evaluation metric and/or at least one aggregated evaluation metric onto a rating on a predefined scale. For example, the predefined scale may be 1 to 5 and the rating may be any value between 1 to 5. Any other predefined scale values may also be employed. In some embodiments, the mapping may be based on predefined rules that were defined manually by an expert or automatically.
[0119] According to some embodiments, the at least one evaluation metric and/or at least one aggregated evaluation metric may be mapped onto a predefined scale automatically using one or more machine learning models configured to predict a rating based on the at least one evaluation metric and/or at least one aggregated evaluation metric. Examples of machine learning models include random forest, support vector machines, and neural networks. In some embodiments, separate machine learning models may be trained for different locations, such as countries, cities, and towns.
[0120] In some embodiments, an annotated dataset may be generated to train the one or more machine learning models. In some embodiments, the annotated dataset may comprise at least 2,500 instances. An example of the steps to generate the annotated dataset is as follows: d) Collect data by driving a data collection vehicle driven through multiple road environments and situations (e.g., roads and highways), during multiple time situations (e.g., peak hours, non-peak hours, morning, and night), different traffic participant densities, and different weather conditions (e.g., spring, summer, autumn, winter, rainy, sunny, and cloudy). e) Determine the at least one evaluation metric and/or at least one aggregated evaluation for the data collected. f) Annotate the data with a rating on a predefined scale. The annotation may be carried out by an expert.
[0121] In some embodiments, the one or more machine learning models may be a trained neural network with a multilayer perceptron architecture. The architecture and training of the trained neural network with a multilayer perceptron configured to determine a total score described above in relation to Table 2 may also be employed for the one or more machine learning models configured to predict a rating based on the at least one evaluation metric and/or at least one aggregated evaluation metric.
[0122] In some embodiments, separate machine learning models may be trained for each of the at least one evaluation metric, and separate machine learning models may be trained for each of the at least one aggregated evaluation metric. For example, a first machine learning model may be trained to predict a rating based at least on a number of traffic participants with the following details:
• Dataset split: 60:20:20 (training : verification : testing)
• Input: place, journey start datetimejourney end datetime, weather, traffic participant count, speed of vehicle, peak or not peak hours, etc.
• Target: Rating between 1 and 5
• Pre-processing: All numerical column including the target are normalized and one- hot encoding applied to categorical columns before passing to the network
• Output: rating (mapped from 0-1 to 1-5)
[0123] For example, a second machine learning model may be trained to predict a rating based at least on a number of potential collision events with the following details:
• Dataset split: 60:20:20 (training : verification : testing)
• Input: place, journey start datetime, journey end datetime, weather, potential collision event count, speed of vehicle, peak or not peak hours, etc.
• Target: Rating between 1 and 5
• Pre-processing: All numerical column including the target are normalized and one- hot encoding applied to categorical columns before passing to the network
• Output: rating (mapped from 0-1 to 1-5)
[0124] For example, a third machine learning model may be trained to predict a rating based at least on a number of lane changes by other traffic participants with the following details:
• Dataset split: 60:20:20 (training : verification : testing)
• Input: place, journey start datetime, journey end datetime, weather, other participant lane change count, speed of vehicle, peak or not peak hours, etc.
• Target: Rating between 1 and 5
• Pre-processing: All numerical column including the target are normalized and one- hot encoding applied to categorical columns before passing to the network
• Output: rating (mapped from 0-1 to 1-5)
[0125] It is emphasised that the architecture, training method, and parameters of the machine learning model described above are only examples, and any other appropriate architecture, training method, and parameters may be employed. [0126] According to some embodiments, method 1000 may comprise operation 1016 wherein the rating for each of the at least one evaluation metric and/or at least one aggregated evaluation metric is weighted to generate a weighted score. In some embodiments, the weighting of the rating of each of the at least one evaluation metric and/or at least one aggregated evaluation metric may be based on predefined weights.
[0127] According to some embodiments, method 1000 may comprise operation 1024 wherein a total score is generated based on the weighted score. In some embodiments, the total score may be generated by summing up the weighted scores generated in operation 1016. In some embodiments, the total score may be generated at various time intervals during a journey. For example, the total score may be generated every 10 minutes in a journey. In some embodiments, the total score may be mapped onto a predefined rating scale. In some embodiments, the output from operations 1008, 1016, and 1024 may be provided as a report to the driver or a third party.
[0128] Table 3below illustrates an example of the output from operations 908, 916, and 924 of method 900.
Table 3
Figure imgf000042_0001
[0129] Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the embodiments of the present invention are intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.

Claims

1. A computer-implemented method for evaluation of the driving of an autonomous or a non-autonomous driving vehicle, the method comprising: receiving a plurality of images captured by a plurality of sensors, wherein the plurality of sensors is mounted on a vehicle and positioned to capture a scene on a side of the vehicle; selecting a representative sensor pose for the side of the vehicle based at least on the plurality of images; and determining at least one evaluation metric based on at least one image associated with the representative sensor pose.
2. The computer-implemented method of claim 1, wherein the plurality of sensors captures at least a 180° field of view of the side of the vehicle, and/or each sensor of the plurality of sensors has an overlapping field of view with a neighbouring sensor, and/or each sensor of the plurality of sensors has a field of view of at least 60°.
3. The computer-implemented method of any of the preceding claims, wherein selecting a representative sensor pose for the side of the vehicle is further based on at least one novel image of the scene on the side of the vehicle.
4. The computer-implemented method of claim 3, wherein the at least one novel image of the scene on the side of the vehicle is generated by: receiving depth data corresponding to the plurality of images captured by the plurality of sensors; generating a representation of the scene on the side of the vehicle based on the received plurality of images and corresponding depth data, wherein the representation accounts for space and time; and generating at least one novel image of the scene on the side of the vehicle based on the representation of the scene on the side of the vehicle.
5. The computer-implemented method of any of the preceding claims, wherein selecting a representative sensor pose for the side of the vehicle comprises: identifying a plurality of traffic participants in the plurality of images and/or at least one novel image; and selecting, from the plurality of images and/or at least one novel image, a representative image, wherein the representative sensor pose is a sensor pose associated with the representative image, and wherein preferably the representative image is the image that comprises the largest number of traffic participants.
6. The computer-implemented method of any of the preceding claims, wherein the at least one evaluation metric is based on a complexity of the scene at the side of the vehicle and one or more potential collision events.
7. The computer-implemented method of any of the preceding claims, wherein the at least one evaluation metric comprises one or more of: a number of potential collision events, a number of traffic participants, or a number of lane changes by other traffic participants.
8. The computer-implemented method of any of the preceding claims, wherein the at least one evaluation metric comprises a number of potential collision events, wherein a potential collision event is determined based on a time to near-collision for each traffic participant, wherein determining a number of potential collision events preferably comprises: receiving consecutive images associated with the representative sensor pose; predicting a time to near-collision for each traffic participant, more preferably using an image-based collision prediction model; and determining a number of potential collision events.
9. The computer-implemented method of any of the preceding claims, wherein the at least one evaluation metric comprises a number of lane changes by other traffic participants, wherein determining a number of lane changes by other traffic participants preferably comprises: receiving consecutive images associated with the representative sensor pose; predicting a lane change behaviour for each traffic participant, more preferably using a lane change model; and determining a number of lane changes by other traffic participants.
10. A computer-implemented method for evaluation of the driving of an autonomous or a non-autonomous driving vehicle, the method comprising: carrying out the computer-implemented method of any of the preceding claims on each side of the vehicle to determine at least one evaluation metric for each side of the vehicle; and generating at least one aggregated evaluation metric.
11. The computer-implemented method of any of the preceding claims, further comprising: generating a control signal to control the vehicle based on the at least one evaluation metric and/or at least one aggregated evaluation metric, wherein the control signal is preferably configured to: generate an alert for a driver of the vehicle; generate an alert for other traffic participants; stop and/or adjust driving functions of the vehicle; and/or adjust an autonomous driving function controlling the vehicle.
12. A system comprising at least one plurality of sensors, at least one processor and a memory that stores executable instructions for execution by the at least one processor, the executable instructions comprising instructions for performing a computer-implemented method according to any of the preceding claims.
13. A vehicle comprising the system of claim 12, wherein the system comprises four plurality of sensors mounted on the vehicle and positioned to capture a scene surrounding the vehicle.
14. A computer program, a machine-readable storage medium, or a data carrier signal that comprises instructions, that upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of a computer- implemented method according to any one of claims 1 to 11.
PCT/EP2023/073106 2022-10-05 2023-08-23 System and method for evaluation of the driving of a vehicle WO2024074246A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB2214623.7 2022-10-05
GB2214623.7A GB2623296A (en) 2022-10-05 2022-10-05 System and method for evaluation of the driving of a vehicle

Publications (1)

Publication Number Publication Date
WO2024074246A1 true WO2024074246A1 (en) 2024-04-11

Family

ID=84000237

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2023/073106 WO2024074246A1 (en) 2022-10-05 2023-08-23 System and method for evaluation of the driving of a vehicle

Country Status (2)

Country Link
GB (1) GB2623296A (en)
WO (1) WO2024074246A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021144689A (en) * 2020-03-12 2021-09-24 株式会社豊田中央研究所 On-vehicle sensing device and sensor parameter optimization device
US20220111794A1 (en) * 2019-01-31 2022-04-14 Mitsubishi Electric Corporation Driving support device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10685239B2 (en) * 2018-03-18 2020-06-16 Tusimple, Inc. System and method for lateral vehicle detection
JP7052632B2 (en) * 2018-08-10 2022-04-12 トヨタ自動車株式会社 Peripheral display device for vehicles
US11532168B2 (en) * 2019-11-15 2022-12-20 Nvidia Corporation Multi-view deep neural network for LiDAR perception
US11762094B2 (en) * 2020-03-05 2023-09-19 Uatc, Llc Systems and methods for object detection and motion prediction by fusing multiple sensor sweeps into a range view representation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220111794A1 (en) * 2019-01-31 2022-04-14 Mitsubishi Electric Corporation Driving support device
JP2021144689A (en) * 2020-03-12 2021-09-24 株式会社豊田中央研究所 On-vehicle sensing device and sensor parameter optimization device

Also Published As

Publication number Publication date
GB2623296A (en) 2024-04-17
GB202214623D0 (en) 2022-11-16

Similar Documents

Publication Publication Date Title
Reiher et al. A sim2real deep learning approach for the transformation of images from multiple vehicle-mounted cameras to a semantically segmented image in bird’s eye view
EP3278317B1 (en) Method and electronic device
US11205082B2 (en) Spatiotemporal relationship reasoning for pedestrian intent prediction
Rasouli et al. Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior
Yang et al. Crossing or not? Context-based recognition of pedestrian crossing intention in the urban environment
Jebamikyous et al. Autonomous vehicles perception (avp) using deep learning: Modeling, assessment, and challenges
Peng et al. MASS: Multi-attentional semantic segmentation of LiDAR data for dense top-view understanding
CN112200129A (en) Three-dimensional target detection method and device based on deep learning and terminal equipment
CN114120439A (en) Pedestrian intention multi-task identification and track prediction method under self-vehicle view angle of intelligent vehicle
Munir et al. LDNet: End-to-end lane marking detection approach using a dynamic vision sensor
Li et al. DBUS: Human driving behavior understanding system
Paravarzar et al. Motion prediction on self-driving cars: A review
Zhang et al. Pedestrian behavior prediction using deep learning methods for urban scenarios: A review
CN115760921A (en) Pedestrian trajectory prediction method and system based on multi-target tracking
CN113792598B (en) Vehicle-mounted camera-based vehicle collision prediction system and method
EP3767543A1 (en) Device and method for operating a neural network
Wang et al. Multi-sensor fusion technology for 3D object detection in autonomous driving: A review
Pang et al. TransCAR: Transformer-based camera-and-radar fusion for 3D object detection
CN114972182A (en) Object detection method and device
Mannion Vulnerable road user detection: state-of-the-art and open challenges
JP2023529239A (en) A Computer-implemented Method for Multimodal Egocentric Future Prediction
Dunna et al. A Deep Learning based system for fast detection of obstacles using rear-view camera under parking scenarios
CN117372991A (en) Automatic driving method and system based on multi-view multi-mode fusion
CN116824541A (en) Pedestrian crossing intention prediction method, model and device based on double channels
WO2024074246A1 (en) System and method for evaluation of the driving of a vehicle

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23761114

Country of ref document: EP

Kind code of ref document: A1