GB2621601A

GB2621601A - System and method for evaluation of the driving of a driver operating a vehicle

Info

Publication number: GB2621601A
Application number: GB2211974.7A
Authority: GB
Inventors: Rajendran Vinod
Original assignee: Continental Automotive Technologies GmbH
Current assignee: Continental Automotive Technologies GmbH
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2024-02-21
Also published as: GB202211974D0; WO2024037776A1

Abstract

A computer-implemented method for evaluating the driving of a driver operating a vehicle comprises receiving at least one image 116 captured by at least one camera 104 mounted on the vehicle and positioned to capture at least one image of a scene surrounding the vehicle; generating a scene textual description 138 for each image; and evaluating at least one evaluation metric by comparing the generated scene textual description against sensor data 120 captured by at least one sensor 108. The sensor 108 may be a further camera, radar, lidar, GPS systems, proximity sensors, turn indicator sensors or a speedometer all mounted on or part of the vehicle. Otherwise, the sensor may be external to the vehicle, such as mounted on another vehicle or infrastructure. Scene textual description 138 may comprise sentences describing the scene, including descriptions of objects and/or traffic participants (such as vehicles, traffic signs, pedestrians, etc.). The evaluation metric may relate to traffic light compliance, traffic sign compliance, or lane compliance. The invention provides an evaluation of a driver’s skill considering the driver’s interaction with other vehicle drivers and compliance with traffic rules and regulations, which are all linked to the safety of a driver on the roads.

Description

SYS IFNI ANT) METHOD FOR EVALUATION OF THE DRIVING OF A DRIVER OPERATING A VEHICLE

TECHNICAL FIELD

[0001] The invention relates generally to driver evaluation, and more specifically to a system and method for evaluation of the driving of a driver operating a vehicle for a particular trip

based on scene description.

BACKGROUND

[0002] With an increasing number of vehicles on the road, there is an increasing need to evaluate the skills of each driver to ensure safety. There are several methods to evaluate the skills of a driver and systems have been developed to provide drivers with a feedback or evaluation of their driving skills, as well as to provide information as to how their driving can be improved. One method of evaluating driving skills is through analysis of a driver's skill while turning. As turning requires a driver to perform complicated manipulations such as manipulating the brake pedal, steering wheel, and accelerator pedal at the appropriate time, intensity, and angle, turning has been used in prior art to evaluate the driver's skills.

Other methods to evaluate the driver's skill includes a comparison between a threshold and a vector calculated by synthesizing the longitudinal acceleration and the lateral acceleration.

[0003] The methods described tend to analyse and evaluate a driver's skill by considering how the driver manipulates the vehicle. However, a driver's skill cannot be considered in isolation as many factors contribute to a driver's skill and safety level on the road, including their compliance with traffic signs, lights, rules, and interactions with other road users. In addition, a driver's skill cannot be considered at a single timepoint based on a specific manoeuvre by the driver. An accurate evaluation of a driver's skill is therefore not achieved in existing methods which focus on the driver's skill in isolation and based on a single manoeuvre by the driver.

SUMMARY

[0004] It is an object of the present invention to provide an improved method and system for the evaluation of the driving of a driver operating a vehicle by evaluating the driver's skill in a holistic way that accounts for the driver's overall compliance with traffic rules and/or interaction with other road users or traffic participants in relation to a particular trip by the driver.

[0005] The object of the present invention is solved by the subject-matter of the independent claims, wherein further embodiments are incorporated into the dependent claims.

[0006] It shall be noted that all embodiments of the present invention concerning a method might be carried out with the order of the steps as described, nevertheless this has not to be the only and essential order of the steps of the method. The herein presented methods can be carried out with another order of the disclosed steps without departing from the respective method embodiment, unless explicitly mentioned to the contrary hereinafter.

[0007] To solve the above technical problems, the present invention provides a computer-implemented method for evaluation of the driving of a driver operating a vehicle, the method comprising: receiving at least one image captured by at least one camera mounted on the vehicle and positioned to capture at least one image of a scene surrounding the vehicle; generating a scene textual description for each of the at least one images; and evaluating at least one evaluation metric by comparing the generated scene textual description against at least one sensor data captured by at least one sensor.

[0008] The computer-implemented method of the present invention is advantageous over known methods as the method provides a better evaluation of a driver's skill by considering the driver's interaction with other vehicle drivers, along with compliance with traffic rules and regulations through the use of scene textual descriptions and quantitative sensor data.

[0009] A preferred method of the present invention is a computer-implemented method as described above, wherein the at least one sensor is different from the at least one camera [0010] The above-described aspect of the present invention has the advantage that using at least one sensor separate from the at least one camera increases the accuracy of the method as the driver's skill is evaluated and verified based on input from two or more separate sensors.

[0011] A preferred method of the present invention is a computer-implemented method as described above, wherein evaluating at least one evaluation metric comprises comparing each generated scene textual description against at least one sensor data captured within a predefined time margin of or at the same timepoint as the received at least one image used to generate said scene textual description.

[0012] The above-described aspect of the present invention has the advantage that comparing the generated scene textual description against sensor data captured within a predefined time margin of the received at least one image used to generate said scene textual description allows the offset of temporal alignment between the camera and the at least one sensor, while comparing the generated scene textual description against sensor data captured at the same timepoint as the received at least one image used to generate said scene textual description allows an accurate assessment at the point in time where the at least one image was captured by the camera.

[0013] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein after receiving the at least one image and before generating the scene textual description, the method further comprises: generating a plurality of novel images of the scene based on the received at least one image, wherein preferably the plurality of novel images comprises all possible viewpoints of the scene; and generating the scene textual description is also carried out on the plurality of novel images.

[0014] The above-described aspect of the present invention has the advantage that the method is able to generate images with perspectives of unseen poses or dimensions for a more comprehensive evaluation of the driver's skill as compared to only relying on images captured by the at least one camera.

[0015] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein generating a plurality of novel images of the scene comprises using pixel Neural Radiance Field (pixelNeRF) or Single

View Neural Radiance Field (SinNeRF).

[0016] The above-described aspect of the present invention has the advantage that pixelNeRF and SinNeRF are able to generate novel images from a plurality of viewpoints based on a minimal number of images as pixelNeRF and SinNeRF utilise pretrained neural networks or models. Usage of pixelNeRF and SinNeRF also allow the utilisation of less cameras as compared to a traditional Neural Radiance Field (NeRF) to capture images while still retaining the ability to generate novel images.

[0017] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the pixelNeRF or SinNeRF is trained on a training dataset generated by a first set of cameras mounted on a left of a data collection vehicle, a second set of cameras mounted on a right of the data collection vehicle, a third set of cameras mounted on a front of the data collection vehicle, and/or a fourth set of cameras mounted on a rear of the data collection vehicle.

[0018] The above-described aspect of the present invention has the advantage that the training dataset to train the pixelNeRF or SinNeRF may be collected using a single data collection vehicle and would contain sufficient information to condition the pixelNeRF or SinNeRF prior to implementation such that the pixelNeRF or SinNeRF conditioned on the training dataset is robust and is able to generate novel images from a plurality of viewpoints based on a single and/or small or minimal number of images.

[0019] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein each set of cameras comprises at least five cameras mounted equidistant from each other, wherein preferably each set of cameras has a centre camera with a horizontal field of view greater than 170 degrees and a vertical field of view greater than 130 degrees and the other cameras have fields of view that are within the horizontal field of view and vertical field of view of the centre camera.

[0020] The above-described aspect of the present invention has the advantage that the incorporation of at least five cameras mounted equidistant from each other would generate an appropriate dataset for conditioning the pixelNeRF or SinNeRF prior to implementation. Usage of a centre camera with a horizontal field of view greater than 170 degrees and a vertical field of view greater than 130 degrees is advantageous to capture more information for the conditioning of the pixelNeRF or SinNeRF. Ensuring that the other cameras have fields of view that are within the horizontal field of view and vertical field of view of the centre camera is advantageous as irrelevant information outside of the field of view of the centre camera is excluded.

[0021] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein generating a scene textual description for each of the at least one image comprises: passing each of the at least one image through an encoder neural network to generate an intermediate representation of each of the at least one image, wherein the encoder neural network is preferably a convolutional neural network (CNN) and the intermediate representation is preferably at least one feature vector, and wherein more preferably the encoder neural network is a convolutional neural network having a Visual Geometry Group (VGG) architecture; and passing the intermediate representation through a decoder neural network, preferably a decoder neural network using at least a part of its previous output as at least a part of its next input, more preferably a recurrent neural network (RNN), and even more preferably a recurrent neural network with long short-term memory architecture (LSTM), to generate a sequence of caption words for each of the at least one image, wherein the sequence of caption words make up the scene textual description for the at least one image.

[0022] The above-described aspect of the present invention has the advantage that the generated scene textual description may be used to provide an accurate representation of a scene and what is happening around the driver. A convolutional neural network is preferred as the encoder neural network as a CNN may be able to extract informative features from the training data without the need for manual processing of the training data. The CNN may produce accurate results where large unstructured data is involved, such as image classification, speech recognition and natural language processing. Also, a CNN is computationally efficient as a CNN is able to assemble patterns of increasing complexity using the relatively small kernels in each hidden layer. VGG architecture is even more preferred as it can produce accurate and fast results once it has been pretrained. The small-size convolutional filters of the VGG allow the VGG to have a large number of weight layers which leads to improved performance. A decoder neural network using at least a part of its previous output as at least a part of its next input is preferred to generate a sequence of caption words as words in a sentence are dependent on the words that come before itself A recurrent neural network (RNN) is more preferred for the decoder neural network as a RNN can model time sequences or sequential information so that each pattern may be assumed to be dependent on previous records. Long short-term memory (LSTM) is even more preferred as LSTM solves the problem of vanishing or exploding gradient in regular RNNs and the LSTIVI-RNN is capable of learning lengthy-time period dependencies due to ability to maintain information in memory for long periods of time.

[0023] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the at least one evaluation metric comprises one or more of: traffic light compliance, traffic sign compliance, or lane compliance [0024] The above-described aspect of the present invention has the advantage that the usage of evaluation metrics such as traffic light compliance, traffic sign compliance, or lane compliance provide a better evaluation of a driver's skill by considering the driver's interaction with other vehicle drivers, along with compliance with traffic rules and regulations, all of which are linked to the safety of a driver on the roads.

[0025] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein evaluating at least one evaluation metric comprises: performing entity recognition on the scene textual description to identify one or more entities mentioned in the scene textual description; and assigning a value to each of the one or more identified entities based on the at least one sensor data, wherein the at least one sensor data is predefined for each of the one or more identified entities.

[0026] The above-described aspect of the present invention has the advantage that one or more entities are able to be analysed and evaluated simultaneously based on a single scene textual description generated for an image.

[0027] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein assigning a value to each of the one or more identified entities comprises: identifying keywords in the scene textual description associated with each of the one or more identified entities based on a set of keywords predefined for each of the one or more identified entiti es; comparing the at least one sensor data predefined for each of the one or more identified entities against a threshold predefined for each keyword; and assigning a value to each of the one or more identified entities based on the comparison of the at least one predefined sensor data against the predefined threshold.

[0028] The above-described aspect of the present invention has the advantage that the analysis is may be more accurate due to the usage of keywords and thresholds which had been predefined by an expert, as compared to methods in systems wherein the keywords and threshold are learned by the system itself [0029] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the evaluation metrics are determined over a journey, preferably at fixed time intervals, and the evaluation metrics are aggregated to determine a journey score for each entity.

[0030] The above-described aspect of the present invention has the advantage that the evaluation of the driver's skill is more comprehensive as the driver's skill is analysed over a journey, and the score would not be largely affected by one or two incidents which may be aberrations.

[0031] The above-described advantageous aspects of a computer-implemented method of the invention also hold for all aspects of a below-described system of the invention. All below-described advantageous aspects of a system of the invention also hold for all aspects of an above-described computer-implemented method of the invention.

[0032] The invention also relates to system comprising at least one camera, at least one sensor, at least one processor and a memory that stores executable instructions for execution by the at least one processor, the executable instructions comprising instructions for performing the computer-implemented method of the invention.

[0033] The above-described advantageous aspects of a computer-implemented method or system of the invention also hold for all aspects of a below-described vehicle of the invention. All below-described advantageous aspects of a vehicle of the invention also hold for all aspects of an above-described computer-implemented method or system of the invention [0034] The invention also relates to a vehicle comprising the system of the invention, wherein the at least one camera is mounted on the vehicle and positioned to capture at least one image of a scene surrounding the vehicle.

[0035] The above-described advantageous aspects of a computer-implemented method, system, or vehicle of the invention also hold for all aspects of below-described computer program, a machine-readable storage medium, or a data carrier signal of the invention. All below-described advantageous aspects of a computer program, a machine-readable storage medium, or a data carrier signal of the invention also hold for all aspects of an above-described computer-implemented method, system, or vehicle of the invention.

[0036] The invention also relates to a computer program, a machine-readable storage medium, or a data carrier signal that comprises instructions, that upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of a computer-implemented method according to the invention. The machine-readable medium may include any medium and/or mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device). The machine-readable medium may be any medium, such as for example, read-only memory (ROM); random access memory (RAM); a universal serial bus (USB) stick; a compact disc (CD); a digital video disc (DVD); a data storage device; a hard disk; electrical, acoustical, optical, or other forms of propagated signals (e.g., digital signals, data carrier signal, carrier waves), or any other medium on which a program element as described above can be transmitted and/or stored.

[0037] As used in this summary, in the description below, in the claims below, accompanying drawings, the term -cameral' means any device that captures images, including video cameras, the camera comprising one or more visible light sensors that converts optical signals into electrical signals to capture an image when exposed [0038] As used in this summary, in the description below, in the claims below, and in the accompanying drawings, the term "sensor" includes any sensor that detects or responds to some type of input from a perceived environment or scene. Examples of sensors include cameras, video cameras, LiDAR sensors, radar sensors, depth sensors, light sensors, colour sensors, or red, green, blue, and distance (RGBD) sensors.

[0039] As used in this summary, in the description below, in the claims below, and in the accompanying drawings, the term "sensor data" means the output or data of a device, also known as a sensor, that detects and responds to some type of input from the physical environment.

[0040] As used in this summary, in the description below, in the claims below, and in the accompanying drawings, the term "scene" refers to a distinct physical environment that may be captured by one or more sensors. A scene may include one or more objects that may be captured by one or more sensors, whether such object is stationary, static, or mobile

BRIEF DESCRIPTION OF THE DRAWINGS

[0041] These and other features, aspects, and advantages will become better understood with regard to the following description, appended claims, and accompanying drawings where: [0042] Fig 1 is a schematic illustration of a system for evaluation of the driving of a driver operating a vehicle, in accordance with embodiments of the present disclosure, [0043] Fig 2 is a flowchart of a method for evaluation of the driving of a driver operating a vehicle, in accordance with embodiments of the present disclosure, [0044] Fig. 3 is a schematic illustration of a set of cameras mounted on a data collection vehicle, in accordance with embodiments of the present disclosure, [0045] Fig. 4 is a pictorial representation of an image captioning model that may be used to generating a scene textual description based on an input image, in accordance with

embodiments of the present disclosure,

[0046] Fig. 5 is a flowchart of a method for evaluation of at least one evaluation metric, in accordance with embodiments of the present disclosure, [0047] Fig. 6 is a flowchart of a method for assigning a value to each of the one or more identified entities, in accordance with embodiments of the present disclosure, [0048] Figs. 7A to 7D are examples of images captured by a camera, in accordance with embodiments of the present disclosure, and [0049] Fig. 8 is a schematic illustration of a computer system within which a set of instructions, when executed, may cause one or more processors of the computer system to perform one or more of the methods described herein, in accordance with embodiments of

the present disclosure.

[0050] In the drawings, like parts are denoted by like reference numerals.

[0051] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and executed by a computer or processor, whether or not such computer or processor is explicitly shown.

DETAILED DESCRIPTION

[0052] In the summary above, in this description, in the claims below, and in the accompanying drawings, reference is made to particular features (including method steps) of the invention. It is to be understood that the disclosure of the invention in this specification includes all possible combinations of such particular features. For example, where a particular feature is disclosed in the context of a particular aspect or embodiment of the invention, or a particular claim, that feature can also be used, to the extent possible, in combination with and/or in the context of other particular aspects and embodiments of the invention, and in the inventions generally.

[0053] In the present document, the word "exemplary" is used herein to mean "serving as an example, instance, or illustration." Any embodiment or implementation of the present subject matter described herein as "exemplary" is not necessarily be construed as preferred or advantageous over other embodiments.

[0054] While the disclosure is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however that it is not intended to limit the disclosure to the forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternative falling within the scope of the disclosure.

[0055] The present disclosure is directed to methods, systems, vehicles, computer programs, machine-readable storage media, and data carrier signals for evaluation of the driving of a driver operating a vehicle. Evaluation of the driving of a driver operating a vehicle is carried out by generating scene textual description for at least one image captured by at least one camera mounted on the vehicle and evaluating at least one evaluation metric by comparing the generated scene textual description against at least one sensor data captured by at least one sensor. In some embodiments, the generated scene textual description may be compared against at least one sensor data captured within a predefined time margin of the received at least one image used to generate the scene textual description. In some embodiments, the generated scene textual description may be compared against at least one sensor data captured at the same timepoint as the received at least one image used to generate the scene textual description. In some embodiments, novel images from a plurality of viewpoints may be generated for a more comprehensive evaluation of the driver's skill. In some embodiments, the evaluation results are aggregated over a journey for an even more comprehensive evaluation of the driver's skill.

[0056] The following description sets forth exemplary methods, parameters, and the like It should be recognized, however, that such description is not intended as a limitation on the scope of the present disclosure but is instead provided as a description of exemplary embodiments [0057] Although the following description uses terms -first,' second," etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another. For example, a first set of cameras could be termed a second set of cameras, and, similarly, a third set of cameras could be termed a first set of cameras, without departing from the scope of the various described embodiments. The first set of cameras, the second set of cameras, the third set of cameras, arid the fourth set of cameras are all sets of cameras, but they are not the same set of cameras.

[0058] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that on-going technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. The terms "comprises", "comprising", "includes" or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a setup, device or method that includes a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup or device or method. In other words, one or more elements in a system or apparatus proceeded by "comprises,.. a" does not, without more constraints, preclude the existence of other elements or additional elements in the system or method. It must also be noted that as used herein and in the appended claims, the singular forms "a," "an," and "the" include plural references unless the context clearly dictates otherwise.

[0059] Fig. 1 is a schematic illustration of a system for evaluation of the driving of a driver operating a vehicle, in accordance with embodiments of the present disclosure. System 100 for evaluation of the driving of a driver operating a vehicle may comprise at least one camera 104, at least one sensor 108, and at least one processor 112. In some embodiments, the at least one camera 104 is mounted on the vehicle and positioned to capture at least one image 116 of a scene surrounding the vehicle. In some embodiments, the at least one camera 104 may be mounted on a front of the vehicle, a rear of the vehicle, a left of the vehicle, a right of the vehicle, or any combination thereof In some embodiments, the at least one camera 104 may capture the at least one image 116 continuously over a journey. In some embodiments, the at least one camera 104 may capture the at least one image 116 at fixed time intervals over a journey. In some embodiments, the at least one sensor 108 may be any sensor, including cameras, radar, lidar, vehicle sensors such as proximity sensors, distance sensors, turn indicator activation sensors, GPS systems, and speedometers, as well as road sensors. In some embodiments, the at least one sensor 108 may generate at least one sensor data 120. In some embodiments, the at least one sensor 108 may comprise the at least one camera 104. In other embodiments, the at least one sensor 108 may be different from the at least one camera 104. In some embodiments, the at least one sensor 108 may be mounted on the vehicle. In some embodiments, the at least one sensor 108 may be mounted on other vehicles and/or infrastructure such that the at least one sensor data 120 is received by the vehicle through vehicle-to-vehicle (V2 V) or vehi cle-to-everything (V2X) communication.

[0060] According to some embodiments, the at least one image 116 captured by the at least one camera 104 and the at least one sensor data 120 captured by the at least one sensor 108 may be received and processed by the at least one processor 112. In some embodiments, the at least one processor 112 may receive the at least one image 116 from the at least one camera 104 and/or the at least one sensor data 120 from the at least one sensor 108. In some embodiments, the at least one image 116 captured by the at least one camera 104 and/or the at least one sensor data 120 captured by the at least one sensor data 120 may be stored on a database 124 with a timestamp associated with each of the at least one image 116 and/or In sensor data 120 indicating the time at which the at least one image 116 and/or at least one sensor data 120 was captured. In some embodiments, the time indicated may be the time elapsed since the driver started driving the vehicle. In some embodiments, the time indicated may be a point in time as measured in hours past midnight or noon [0061] According to some embodiments, the at least one processor 112 may comprise several modules, such as an image processing module 128 and an evaluation module 132. In some embodiments, image processing module 128 may receive the at least one image 116 as input. In some embodiments, the image processing module 128 may be coupled to the evaluation module 132 for example, by manner of one or both of wired coupling and wireless coupling. In some embodiments, the evaluation module 132 may receive the at least one sensor data 120 and the output of the image processing module 128 as input.

[0062] According to some embodiments, the image processing module 128 may comprise a scene description generator module 136 which receives at least the at least one image 116 as input and generates as output a scene textual description 138 for each of the at least one image 116 based at least on the at least one image 116. Scene textual description 138 may comprise one or more sentences describing the scene, including but not restricted to descriptions of any objects and/or traffic participants within the scene, such as other vehicles, traffic signs, traffic lights, lane markings, pedestrians, and motorcyclists, as well as the relationship between the objects, traffic participants and/or the vehicle on which system 100 is mounted on.

[0063] According to some embodiments, scene textual description 138 generated by the scene description generator module 136 may be evaluated by evaluation module 132 by comparing the generated scene textual description 138 against the at least one sensor data 120 captured by at least one sensor 108. This may provide an accurate evaluation of a driver's skill as it considers the driver's interaction with other vehicle drivers or traffic participants, along with compliance with traffic rules and regulations through the use of scene textual descriptions and quantitative sensor data. In some embodiments, each generated scene textual description 138 may be compared against least one sensor data 120 captured within a predefined time margin of the received at least one image 116 used to generate said scene textual description. An example of a predefined time margin is ± 5 seconds, wherein the generated scene textual description 138 may be compared against at least one sensor data captured within ± 5 seconds from the timepoint at which the at least one image 116 used to generate said scene textual description 138. In some embodiments, each generated scene textual description 138 may be compared against at least one sensor data 120 captured at the same timepoint as the received at least one image 116 used to generate said scene textual description 138. It is contemplated that, in accordance with an embodiment of this disclosure, the evaluation of the driver's skill can, for example, be communicated via an interactive dashboard (e.g., an electronic dashboard module) which can, for example, facilitate the provision of a report and/or feedback to the driver. It is further contemplated that, in accordance with an embodiment of this disclosure, the evaluation of the driver's skill can, for example, be communicated to a third-party service provider and/or other persons of interest.

[0064] According to some embodiments, the image processing module 128 may further comprise a novel view synthesis module 140. The novel view synthesis module 140 may be coupled to the scene description generator module 136 for example, by manner of one or both of wired coupling and wireless coupling. The novel view synthesis module 140 may receive the at least one image 116 as input and generate as output a plurality of novel images 144 of the scene based on each of the at least one image 116. In some embodiments, the plurality of novel images 144 may comprise one, some, or all possible viewpoints of the scene to get perspectives of unseen poses or dimensions to increase the comprehensiveness in evaluation of the driver's skill. In some embodiments, the plurality of novel images 144 generated by the novel view synthesis module 140 may be passed onto the scene description generator module 136 which may generate a scene textual description 138 for each of the plurality of novel images 144. In some embodiments, the generated scene textual description 138 may be evaluated by the evaluation module 132 by comparing the generated scene textual description against the at least one sensor data 120 captured by at least one sensor 108. In some embodiments, each generated textual description may be compared against at least one sensor data UO captured within a predefined time margin of the received at least one image 116 used to generate the plurality of novel images 144 and scene textual description 138. In some embodiments, each generated textual description may be compared against at least one sensor data 120 captured at the same timepoint as the received at least one image 116 used to generate the plurality of novel images 144 and scene textual

description 138.

[0065] For the sake of convenience, the operations of the present disclosure are described as interconnected functional blocks or distinct software modules. This is not necessary, however, and there may be cases where these functional blocks or software modules are equivalently aggregated into a single logic device, program, or operation with unclear boundaries. In any event, the functional blocks and software modules or described features can be implemented by themselves, or in combination with other operations in either hardware or software. Each of the modules described may correspond to one or both of a hardware-based module and a software-based module. The present disclosure contemplates the possibility that each of the modules described may be an integrated software-hardware based module (e.g., an electronic part which can carry a software program/algorithm in association with receiving and processing functions/an electronic module programmed to perform the functions of receiving, processing and/or transmitting). The present disclosure yet further contemplates the possibility that each of the modules described can be an integrated hardware module (e.g., a hardware-based transceiver) capable of performing the functions of receiving, processing and/or transmitting.

[0066] Fig. 2 is a flowchart of a method for evaluation of the driving of a driver operating a vehicle, in accordance with embodiments of the present disclosure. Computer-implemented method 200 for evaluation of the driving of a driver operating a vehicle may be implemented by a data processing device on any architecture and/or computing system. For example, various architectures employing, for example, multiple integrated circuit (IC) chips and/or packages, and/or various computing devices and/or consumer electronic (CE) devices such as multi-function devices, tablets, smart phones, etc., may implement the techniques and/or arrangements described herein. Method 200 may be stored as executable instructions that, upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of method 200.

[0067] According to some embodiments, method 200 for evaluation of the driving of a driver operating a vehicle may commence at operation 204, wherein at least one image 116 captured by at least one camera 104 mounted on the vehicle and positioned to capture at least one image of a scene surrounding the vehicle is received. In some embodiments, the at least one image 116 may be received from the at least one camera 104 for example, by manner of one or both of wired coupling and wireless coupling. In some embodiments, the at least one image 116 may be stored on database 124 and may be received from a database 124 for example, by manner of one or both of wired coupling and wireless coupling [0068] According to some embodiments, method 200 may comprise operation 208 wherein a scene textual description H8 is generated for each of the at least one image 116. In some embodiments, the scene textual description 138 may be generated using any known image captioning model which generates a scene textual description of an input image. In some embodiments, operation 208 may be carried out by scene description generator module 136.

In some embodiments, the image captioning model may use computer vision and natural language processing to generate the scene textual description. Examples of image captioning generation methods include attention-based methods, attention-based methods that consider spatial and semantic relations between image elements, graph-based methods for spatial and semantic relations between image elements, a combination of attention-based and graph-based methods, and convolution-based methods.

[0069] According to some embodiments, method 200 may comprise operation 212 wherein at least one evaluation metric is evaluated by comparing scene textual description 138 generated in operation 208 against at least one sensor data 120 captured by at least one sensor 108. wherein each generated scene textual description is compared against at least one sensor data captured by at least one sensor. In some embodiments, each generated scene textual description 138 may be compared against least one sensor data 120 captured within a predefined time margin of the received at least one image 116 used to generate said scene textual description. An example of a predefined time margin is ± 5 seconds, wherein the generated scene textual description 138 may be compared against at least one sensor data captured within ± 5 seconds from the timepoint at which the at least one image 116 used to generate said scene textual description 138. In some embodiments, each generated scene textual description 138 may be compared against at least one sensor data 120 captured at the same timepoint as the received at least one image 116 used to generate said scene textual description 138. In some embodiments, operation 212 may be carried out by evaluation module 132. The at least one evaluation metric may be any evaluation metric that may be indicative of a driving behaviour of the driver, the interaction of the driver with other traffic participants, as well as compliance with traffic safety rules and regulations. In some embodiments, the at least one evaluation metric may comprise one or more of traffic light compliance, traffic sign compliance, or lane compliance.

[0070] According to some embodiments, method 200 may comprise operation 216 wherein the evaluation metrics are determined over a journey, and the evaluation metrics are aggregated to determine a journey score for each entity. In some embodiments, operation 216 may be carried out by evaluation module 132. This may facilitate a more comprehensive evaluation of a driver's skill as compared to a single evaluation at a specific time point as the driver's skill is analysed over a journey, and the score would not be largely affected by one or two incidents which may be aberrations. In some embodiments, the evaluation metric may be continuously evaluated over a journey. In some embodiments, the evaluation metric may be evaluated at fixed time intervals, which may achieve accurate results without the high computational load of a continuous evaluation. It is contemplated that the duration of the fixed time intervals may be adjusted based on the accuracy required and/or the computational power available.

[0071] According to some embodiments, method 200 may further comprise operation 220 after operation 204 and before operation 208, wherein in operation 220, a plurality of novel images 144 of the scene is generated based on the received at least one image 116. In some embodiments, the plurality of novel images 144 may comprise one, some, or all possible viewpoints of the scene. Preferably, the plurality of novel images 144 comprises all possible viewpoints of the scene. In embodiments of method 200 further comprising operation 220 after operation 204 and before operation 208, generation of scene textual description 138 in operation 208 is also carried out on the plurality of novel images 144 generated in operation 220. In some embodiments, operation 220 may be carried out by novel view synthesis module 140. It is contemplated that the generation of novel images 144 and scene textual description based on novel images 144 may increase the comprehensiveness of the evaluation of the driver's skill by including perspectives of unseen poses or dimensions found in the plurality of novel images.

[0072] According to some embodiments, operation 220 of generating a plurality of novel images 144 of the scene may comprise using neural radiance field (NeRF). NeRFs are fully connected neural networks that can generate novel views of complex three-dimensional (3D) scenes based on a partial set of input two-dimensional (2D) images of a scene captured from various viewpoints. A neural network generally comprises an input layer comprising one or more input nodes, one or more hidden layers each comprising one or more hidden nodes, and an output layer comprising one or more output nodes. A fully connected neural network, also known as a multilayer perceptron (ML P), is a type of neural network comprising a series of frilly connected layers that connect every neuron in one layer to every neuron in the preceding and subsequent layer. In general, NeRF is capable of representing a scene via values of parameters of a fully connected neural network. In some embodiments, NeRFs are trained to use a rendering loss to reproduce input views of a scene and work by taking multiple input images representing a scene and interpolating between the multiple input scenes to render the complete scene. A NeRF network may be trained to map directly from viewing direction and location (5D input) to density and colour (4D output), using volume rendering to render new views. In particular, a continuous scene is represented as a 5D vector-valued function whose input is a 3D location x = (x, y, z) and 2D viewing direction (0, co), and whose output is an emitted colour c = (r, g, b) and volume density a. In some embodiments, a NeRF may comprise eight fully connected ReLU layers, each with 256 channels, followed by sigmoid activation into an additional layer that outputs the volume density a and a 256-dimensional feature vector which is concatenated with the positional encoding of the input viewing direction to be processed by an additional fully connected ReLU layer with 128 channels and passed onto a final layer with a sigmoid activation which outputs the emitted ROB radiance at position x as viewed by a ray with direction ti. In some embodiments, the positional encoding of the input location (7(x)) may be input into NeRF into the first layer of the NeRF, as well as through a skip connection that concatenates the input location into the fifth layer's activation. Rendering an image from a possible viewpoint of the scene comprises computing a plurality of rays that go from a virtual camera at the possible viewpoint through the scene; sampling at least one point along each of the plurality of rays to generate a sampled set of locations; using the sampled set of locations and their corresponding viewing direction as input to the NeRF network to produce an output set of col ours and densities; and accumulating the output set of col ours and densities using classical volume techniques to generate a novel image of the scene at the possible viewpoint. In some embodiments, the NeRF can be generally considered to be based on a hierarchical structure. Specifically, the general overall NeRF network architecture can, for example, be based on a coarse network and a fine network. In this regard, a first scene function (e.g., a -course" scene function) can, for example, be evaluated at various points (e.g., a first set of points) along the rays corresponding to each image pixel and based on the density values at these coarse points (e.g., evaluation at various points along the rays) another set of points (e.g., a second set of points) can be re-sampled along the same rays. A second scene function (e.g., a -fine" scene function) can be evaluated at the re-sampled points (e.g., the second set of points) to facilitate in obtaining resulting (fine) densities and colours usable in, for example, NeRF rendering (e.g., volume rendering mechanism associated with NeRF). In a general example, to enable gradient updates of the "coarse" scene function, NeRF can be configured to reconstruct pixel colours based on outputs associated with both the -course" scene function and the "fine" scene function, and minimize a sum of the coarse and fine pixel errors. More information on NeRFs may be found in "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis" by Mildenhall et. al. In the above, although fully connected neural network has been generally mentioned, it is appreciable that other parametric functions designed for 3D rendering can possibly be applicable. An example can be PlenOctrees (i.e., an octree-based 3D representation which supports view-dependent effects). In this regard, it is appreciable that fully connected neural networks and/or other parametric functions designed for 3D rendering like PlenOctrees can possibly be applicable.

[0073] According to some embodiments, operation 220 of generating a plurality of novel images 144 of the scene may comprise using pixel Neural Radiance Field (pixelNeRF) or Single View Neural Radiance Field (SinNeRF). Both pixel NeRF and SinNeRF generate or synthesise novel images from a plurality of viewpoints based on a minimal number of images by utilising pretrained neural networks or models. Usage of pixelNeRF and SinNeRF also allow the utilisation of less cameras to capture images while still retaining the ability to generate novel images.

[0074] PixelNeRF is a learning framework for NeRF that predicts a continuous neural scene representation conditioned on one or few input images in a feed-forward manner. As traditional NeRF s are optimised for each scene independently, many calibrated input images of the scene from various viewpoints and significant computing time are required.

PixelNeRF is able to generate novel images using one or few input images by introducing an architecture that conditions a NeRF on image inputs in a fully convolutional manner. Unlike the traditional NeRF which does not use image features, pixelNeRF takes spatial image features aligned to each pixel of an input image(s) as an input. In particular, pixelNeRF comprises two components: a fully convolutional image encoder E which encodes an input image into a pixel-aligned feature grid, and a NeRF networklwhich outputs colour and density, given a spatial location and its corresponding encoded feature. In some embodiments, the spatial query may be modelled in the input view' s camera space, in particular, the corresponding camera that captured the input image. In general, pixelNeRF involves image conditioning, whereby a fully convolutional image feature grid is computed from an input image by the fully convolutional image encoder E, followed by sampling the corresponding image feature via projection and bilinear interpolation for each query spatial point x and viewing direction d of interest in the view coordinate frame, and sending the query specification along with the image features to the NeRF networkfthat outputs density and colour (4D output), where the spatial image features are fed to each layer as a residual. This image conditioning allows the pixelNeRF framework to be trained on a set of multi-view images to learn scene priors and then subsequently deployed for novel view synthesis from one or few input images. In embodiments where there is more than one input image, the input images may first be encoded into a latent representation in each corresponding camera's coordinate frame, which may then be pooled into an intermediate layer prior to predicting the colour and density. In some embodiments, the pixelNeRF may be supervised with a reconstruction loss between a ground truth image (e.g., the original input image) and a view or image rendered based on the pixelNeRF using conventional volume rendering techniques. In some embodiments, the fully convolutional image encoder E may have a ResNet34 backbone, wherein a feature pyramid is extracted by taking the feature maps prior to the first pooling operation and after the first 3 ResNet layers. An image with resolution H x W may be upsampled bi-linearly to hiT x If. and concatenated into a volume of size

H W

512 x -x -. Where an image has a resolution of 64 x 64, the first pooling layer may be 2 2 scaled so that the image resolutions are at -1 -1 -1 -1 of the input rather than!, -1, -1, -1. In some 2' 2' 4' 8 2 4 8 16 embodiments, NeRF network /may comprise a fully connected ResNet architecture with 5 ResNet blocks and width 512 with an average-pooling operation after block 3. In some embodiments, positional encoding 7 from NeRF network f may be used for spatial coordinates, with exponentially increasing frequencies. Additional information on the implementation and training of pixelNeRF may be found at least in Section 5 and Appendix Section B of -pixelNeRF: Neural Radiance Fields from One or Few Images" by Yu el.

[0075] Single View Neural Radiance Field (SinNeRF) is a type of Neural Radiance Field (NeRF) that generates novel images from other possible viewpoints of a scene (or unseen view) based on a single input image (also termed reference view) of the scene. In particular, SinNeRF may synthesise patches (also termed textural patterns) from the input reference view and unseen views using a semi-supervised learning process where geometry pseudo labels which focus on the geometry of the radiance field and semantic pseudo labels which focus on the semantic fidelity of the unseen views are introduced and propagated to guide the progressive training process. In some embodiments, geometry pseudo labels may be generated based on: (i) a depth map obtained through transformation of depth information between an input image and other possible viewpoints using image warping to ensure multi-view geometry consistency of the neural field; (ii) regularization of uncertain regions in results through self-supervised inverse depth smoothness loss; and (iii) reprojection of unseen views back to the reference view to enforce geometry consistency. In some embodiments, semantic pseudo labels may be generated based on adversarial training and pre-trained vision transformer (ViT) to regularise the learned appearance representation.

Adversarial training introduces a local texture guidance loss via a generative adversarial network (CAN) framework (or patch discriminator), in which the outputs from the neural radiance field are considered as fake samples and patches randomly cropped from the reference view are regarded as real samples. In some embodiments, differentiable augmentation may be employed to improve the data efficiency of the discriminator. In some embodiments, the CAN framework may be trained using Hinge loss. In some embodiments, a global structure prior may be supported by the pre-trained ViT (such as DINO-ViT) to enforce semantic consistency between unseen views and the reference view. In some embodiments, SinNeRF may have a multi-layer perception (MLP) architecture similar to the traditional NeRF. During training iterations, two patches of rays are randomly sampled from both the reference view and a random sampled unseen view. In some embodiments, the size of patches on the dataset may be set as 64 x 64. In some embodiments, the rendered patches may be fed into the discriminator and the ViT network. In some embodiments, the rendered patches may be resized to 224 x 224 resolution before feeding into the ViT network. In some embodiments, in the latter stages of the training of the SinNeRF, the stride value may decrease so that the framework may focus on more local regions. In some embodiments, the discriminator may be randomly initialized after reducing the stride size. In each iteration, signed angles between the axis in the reference view's camera coordinate and the axis in the unseen views camera coordinates may be sampled based on a Gaussian distribution. In some embodiments, loss may be computed by annealing the loss weight during training. In some embodiments, the weight (2.3) of global structure prior may larger weight (e.g., 0.1) than the weight (X?) of local texture guidance (e.g., 0) in the early stage of training and the weight (X-i) of the global structure prior may be reduced (to e.g., 0) and the weight (2.2) of the local texture guidance may be increased (to e.g., 0.1) with linear function as training proceeds. More information on the generation of geometry pseudo labels and semantic pseudo labels may be found at least in Sections 3.3 and 3.4 "SinNeRF: Training Neural Radiance Fields on Complex Scenes from a Single Image" by Xu et. al., more information on the progressive training strategy may be found at least in Section 3.5 of the same paper, and more information on the implementation of the SinNeRF may be found at least in Section 4.1 of the same paper.

[0076] Fig. 3 is a schematic illustration of a set of cameras mounted on a data collection vehicle, in accordance with embodiments of the present disclosure. In some embodiments, the training dataset for the pixelNeRF or the SinNeRF may be generated by one or more sets of cameras mounted on a data collection vehicle 304. In some embodiments, the training dataset may be generated by a first set of cameras mounted on a left of data collection vehicle 304, a second set of cameras mounted on a right of data collection vehicle 304, a third set of cameras mounted on a front of data collection vehicle 304, and/or a fourth set of cameras mounted on a rear of data collection vehicle 304. In some embodiments, the training dataset may comprise images captured during different environmental conditions, such as during different seasons and different times during a day (e.g., morning, afternoon, evening, and night). In some embodiments, the images in the training dataset may be unlabelled. In some embodiments, the training dataset may comprise a minimum of 15,000 images captured by each camera mounted on the data collection vehicle 304. In some embodiments, each camera may comprise an image sensor, camera and/or video camera equipped with a fisheye or wide-angle lens having a field of view of not less than 60 degrees. In some embodiments, some of the cameras may comprise an image sensor, camera and/or video camera equipped with a rectilinear lens. In some embodiments, each camera may comprise an image sensor and a depth sensor (e.g., RGB-D camera) such that the camera captures both colour and depth information. In some embodiments, each set of cameras may have the same number of cameras. In some embodiments, each set of cameras may have different number of cameras. In some embodiments, each set of cameras may comprise at least 5 cameras. In some embodiments, a first set of cameras mounted on a left of data collection vehicle 304 and a second set of cameras mounted on a right of data collection vehicle 304 may each comprise a minimum of 10 cameras. In some embodiments, a third set of cameras mounted on a front of data collection vehicle 304, and a fourth set of cameras mounted on a rear of data collection vehicle 304 may each comprise 5 cameras. In some embodiments, the cameras for each set of cameras may be mounted equidistant from each other. In some embodiments, the cameras for each set of cameras may be mounted with differing distances between each other. In some embodiments, each set of cameras may comprise at least 5 cameras mounted equidistant from each other.

[0077] According to some embodiments, each set of cameras may comprise a centre camera 308, and one or more side cameras 312 flanking the sides of centre camera 308. In some embodiments, each set of cameras may comprise a centre camera 308 and four side cameras 312: a first side camera 312a and a second side camera 312b mounted on a first side of centre camera 304, and a third side camera 312c and a fourth side camera 312d mounted on a second side of centre camera 308. In some embodiments, the centre camera 308 may have a horizontal field of view 316 greater than 170 degrees and a vertical field of view (not shown) greater than 130 degrees. In some embodiments, each of the one or more side cameras 312a-312d may have horizontal fields of view 320a-320d that are within the horizontal field of view 316 of the centre camera 308 and each of the one or more side cameras 320a-320d may have vertical fields of view (not shown) that are within the vertical field of view (not shown) of the centre camera 308. In some embodiments, centre camera 308 and the one or more side cameras 312a-312d may each comprise an image sensor. In some embodiments, centre camera 308 may comprise an image sensor and a depth sensor (e.g., RGB-D camera), and the one or more side cameras 312a-312d may comprise an image sensor.

[0078] According to some embodiments, pixelNeRF may be trained on a training dataset Generated using four sets of cameras mounted on a data collection vehicle: a first set of cameras mounted on a left of the data collection vehicle, a second set of cameras mounted on a right of the data collection vehicle, a third set of cameras mounted on a front of the data collection vehicle, and a fourth set of cameras mounted on a rear of the data collection vehicle. In some embodiments, each set of cameras may comprise five cameras. In some embodiments, the pixelNeRF may be trained on the training dataset with random weight initialisations for the NeRF network./ for a minimum of 500,000 iterations with 128 rays with a learning rate of 10-4. In some embodiments, during the training of the pixelNeRF, only the weights of the NeRF network f may be updated with backpropagation while the weights of the fully convolutional image encoder E remain unchanged. In some embodiments, the training images for each iteration may cycle among the different sets of cameras (e.g., images from a first set of cameras mounted on a left of a data collection vehicle for the first iteration, images from a second set of cameras mounted on a right of the data collection vehicle for the second iteration, images from a third set of cameras mounted on a front of the data collection vehicle for the third iteration, a fourth set of cameras mounted on a rear of the data collection vehicle for the fourth iteration, before cycling back to images from a first set of cameras mounted on a left of the data collection vehicle for the fifth iteration, and so on). In some embodiments, the batch size may decrease as training proceeds to encourage the pixelNeRF to work with a single image or view. For example, where each set of cameras comprises 5 cameras, the first 100,000 iterations may be trained with a batch size of 5, wherein each batch comprises the 5 images captured by a set of cameras at the same time point, the next 200,000 iterations may be trained with a batch size of 2, wherein each batch comprises a first image captured by the centre camera and a second image captured by any one of the side cameras of the set of cameras at the same timepoint as the first image, and the final 200,000 iterations may be trained with a batch size of 1, wherein each batch comprises a single image captured by the centre camera with the widest horizontal and vertical field of view. In some embodiments, peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIN4), or learned perceptual image patch similarity (LPIPS) metrics may be used to evaluate the pixel NeRF on test scenes quantitatively.

[0079] According to some embodiments, SinNeRF may be trained on a training dataset generated using four sets of cameras mounted on a data collection vehicle: a first set of cameras mounted on a left of the data collection vehicle, a second set of cameras mounted on a right of the data collection vehicle, a third set of cameras mounted on a front of the data collection vehicle, and a fourth set of cameras mounted on a rear of the data collection vehicle. In some embodiments, the centre camera of each set of cameras generating the training dataset for SinNeRF may be an RGB-D camera. In some embodiments, the SinNeRF may be trained on the training dataset with random weight initializations for the N4LP, pretrained weights for the ViT pre-trained using ImageNet, and random weight initialization for the GAN framework. In some embodiments, the SinNeRF may be trained using RAdam optimizer with an initial learning rate of le'. In some embodiments, the learning rate may be decayed by half after every 10,000 iterations, with the learning rate of the GAIN discriminator maintained to be -5 of that of the MLP. In some embodiments, training of the SinNeRF may comprise initialisation of stride size of ray generation with progressive reduction of stride size during training to cover a much larger region of the image with limited rays. For example, the stride for sampling of the patches may initialise at 6 and gradually reduce by 2 after every 10,000 iterations.

[0080] According to some embodiments, when training the SinNeRF, the centre camera of each set of cameras may be designated as the reference camera, with the colour information captured by the centre camera designated as the reference view while the depth information captured by the centre camera designated as reference depth.

[0081] According to some embodiments, one iteration may comprise four batches of input images wherein each batch may comprise images from a set of cameras captured at the same time point. In some embodiments, training of the SinNeRF may comprise initialisation of parameters Xt, 2L.2, and X3, to 0.8, 0.0, and 0.1 respectively, wherein Xi, k2, and X3 are weighting factors for the calculation of an overall loss function. In some embodiments, the SinNeRF may be trained for a maximum of 100,000 iterations with each iteration being trained on a batch of images comprising images captured by a set of cameras at a specific time point. In some embodiments, the SinNeRF may be trained in batches, wherein the input images for each batch cycles among the different sets of cameras (e.g., images from a third set of cameras mounted on a front of the data collection vehicle for the first batch, images from a first set of cameras mounted on a left of a data collection vehicle for the second batch, a fourth set of cameras mounted on a rear of the data collection vehicle for the third batch, images from a second set of cameras mounted on a right of the data collection vehicle for the fourth batch).

[0082] According to some embodiments, for each batch of input images, all the images may be passed as input into the multi-layer perceptron (1\TLP) of the SinNeRF and a pixel loss may be calculated between a colour rendered by the NILP and the ground truth colour. Unseen views may then generated from the NILP with a predefined set of camera intrinsic and extrinsic metrices with respect to the designated reference view. The unseen views are not restricted to the views from the side cameras of the set of cameras. A first patch from the reference view and a second patch from a random unseen view may be randomly sampled. Image warping may then be applied to the unseen view to generate a depth map and the geometric loss between the depth map of the unseen view and the reference depth may be computed. Local texture guidance may then be applied using a patch discriminator, wherein the output from the IVILP (i.e., the selected sample from an unseen view) is designated as a fake sample and the patch from the reference view is designated as a real sample. Augmentation may then be applied on both the fake sample and the real sample and a GAN network may be trained with the augmented fake and real samples to compute an adversarial loss. Global structure guidance may also be applied using a ViT pre-trained on ImageNet, wherein the reference view patch and unseen view patch are passed as input into the pre-trained ViT which outputs features. In some embodiments, a classification (cis) token may be added in ViT to serve as a representation of an entire image, and the ViT model may output features from the cls token for the reference view patch and unseen view patch. In some embodiments, classification loss, or loss (cis), between the features extracted from the reference view and features extracted from the unseen view may be computed. In some embodiments, the loss (cis) may be computed as an L2 loss. An overall loss function may be computed according to the following equation: Ltotal = Lpt"t +Lige° + A.2 Lcia, + A.3 Leis where Liao/ represents the total loss, Lord represents the pixel loss, Lgeo represents the geometry loss, Leis represents the cls loss, and X1, k2, and k3 are weighting factors. The process is then repeated for each batch of input images.

[0083] According to some embodiments, as the number of training iterations increase, the stride size may be decreased, the weighting factors X2 and k3 may be changed, and the viewing angle between the reference view and unseen view may be enlarged. In some embodiments, after each iteration, XI may be decreased and k2 may be increased until k3 is 0.0 and k2 is 0.1, after which the values of k2 and A.3 will not be changed.

[0084] Fig. 4 is a pictorial representation of an image captioning model that may be used to generating a scene textual description based on an input image, in accordance with embodiments of the present disclosure. It is emphasized that the image captioning model described herein is only an example of an image captioning model that may be employed to generate a scene textual description based the at least one image and/or plurality of novel images. Scene description generating, or image captioning, is mostly presented as a sequence-to-sequence problem in computer or vision, wherein the goal is to convert a specific sequence to the appropriate corresponding sequence.

[0085] According to some embodiments, image captioning model 400 may be employed used by scene description generator module 136 to generate scene textual description 138. In some embodiments, image captioning model 400 may generate a scene textual description 136 based on an input image 408. In some embodiments, the input image 408 may be an image 116 captured by camera 104. In some embodiments, the input image 408 may be a novel image 144 generated by the novel view synthesis module 140. In some embodiments, image captioning model 400 generating a scene textual description 138 based on input image 408 may comprise an encoder neural network 412 and a decoder neural network 416. In some embodiments, the encoder neural network 412 may receive an input image 408 and may generate at least one intermediate representation 416 of the contents of the input image 408 as output. The at least one intermediate representation 420 may then be input into decoder neural network 416 wherein the at least one intermediate representation 420 is converted into a sequence of caption words, wherein the sequence of caption words make up the scene textual description 138 of the input image 408 [0086] According to some embodiments, the encoder neural network 412 may be a convolutional neural network, or CNN, and the intermediate representation 420 may be at least one feature vector. A convolutional neural network (CNN) is a multi-layered feed-forward neural network, made by stacking many hidden layers on top of each other in sequence. The sequential design may allow CNNs to learn hierarchical features. The hidden layers are typically convolutional layers followed by activation layers, some of them followed by pooling layers. The CNN may be configured to identify pattens in data. The convolutional layer may include convolutional kernels that are used to look for patterns across the input data. The convolutional kernel may return a large positive value for a portion of the input data that matches the kernel's pattern or may return a smaller value for another portion of the input data that does not match the kernel's pattern. A CNN is preferred as a CNN may be able to extract informative features from the training data without the need for manual processing of the training data. Also, CNN is computationally efficient as a CNN is able to assemble patterns of increasing complexity using the relatively small kernels in each hidden layer. The CNN outputs feature maps which may be linearised into feature vector to be input into decoder neural network 416.

[0087] According to some embodiments, the encoder neural network 412 may be a CNN having a Visual Geometry Group (VGG) architecture. There are multiple known configurations of CNN having a VGG architecture based on the depth of the network, including but not limited to VGG-16 and VGG-19. CNNs having a VGG architecture typically receive a 224 x 224 RGB image as input and use a very small 3x3 filter throughout the entire neural network with the stride of 1 pixel in combination with non-linear activation layers which results in more discriminative decision functions and reduction in the number of weight parameters. In general, CNNs with VGG architecture have a basic building block of a stack of multiple convolutional layers (usually ranging from Ito 3) with 3x3 filter size, stride of 1, and padding of 1, and each building block (or stack of multiple convolutional layers) is followed by a max pooling layer before being fed into the next building block. The basic building block is repeated with each basic building block comprising differing configurations of the convolutional layers to achieve differing depths. The basic building blocks are then followed by three fully connected layers: a first fully connected layer with 4096 neurons, a second fully connected layer with 4096 neurons, and a third fully connected layer with 1000 neurons. The third fully connected layer is then activated with a softmax activation function for categorical classification.

[0088] An example of a CNN having a VGG architecture is VGG-16. VGG-16 is a CNN that comprises 16 weight layers, and 138 million parameters. VGG-16 further comprises 5 max-pooling layers. The input to a VGG-16 is a 224x224 RGB comprising 3 channels -R, G, and B representing red, green, and blue channels, and the output from a VGG-16 is feature maps with size 14x14 and dimension of 512. The ROB value for each pixel is normalised by subtracting the mean value from each pixel. The input image is passed through a first building block of 2 convolutional layers, wherein each convolutional layer comprises 64 filters with 3x3 filter size, followed by ReLU activation to generate output feature maps. The output feature maps are passed through max pooling over a 2x2 pixel window with a stride of 2 pixels to halve the size of the feature maps. The feature maps are then passed through multiple similar building blocks to generate feature maps: a second building block with 2 convolutional layers (each comprising 128 filters) and a max pooling layer, a third block with 3 convolutional layers (each comprising 256 filters) and a max pooling layer, a fourth building block with 3 convolutional layers (each comprising 512 filters) and a max pooling layer, and a fifth building block with 3 convolutional layers (each comprising 512 filters) and a max pooling layer. In general, all convolutions in the CNN with VGG-16 architecture are carried out with a convolution stride of 1 pixel and padding of 1 pixel to preserve the spatial resolution and generate an output with same dimensions as the input. The building blocks are followed by three fully connected layers interspersed with a flattening layer: a first fully connected layer with 4096 neurons, a first flattening layer, a second fully connected layer with 4096 neurons, a second flattening layer, and a third fully connected layer with 1000 neurons. The output of the third fully connected layer is followed by a sofimax activation layer for categorical classification. The method of training a VGG-16 may be found at least in Section 3.1 of a paper titled "VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION" by Karen Simonyan & Andrew Zisserman.

[0089] According to some embodiments, the decoder neural network 416 may be a neural network that uses at least a part of its previous output as at least a part of its next input. An example of a neural network that uses at least a part of its previous output as at least a part of its next input is a recurrent neural network (RNN). RNNs are a class of neural networks with at least one feedback connection that allow previous outputs to be used as inputs while having hidden states (or -memory-). The feedback connection is from the hidden layer to itself to access information at a previous timestep, or with a delay of one time unit. RNNs perform the same task for each element of a sequence, with the input for each computation being dependent on the output of the previous computation. Unlike general neural networks which use different parameters at each layer, an RNN converts independent activations into dependent. Examples of RNN include an RNN with long short-term memory architecture (LSTM), and Multi-Layer Perceptron (ML]?) with added loops. A RNN is preferred as an RNN can model time sequences or sequential information so that each pattern may be assumed to be dependent on previous records. A RNN may process input of any length with model size not increasing with the size of input. The computation of RNN also takes into account historical information or sequence, and the weights are shared across time.

[0090] According to some embodiments, the decoder neural network 416 may be an RNN-LSTM, or a recurrent neural network with long short-term memory architecture, wherein a structure called a memory cell is included. The inclusion of a memory cell solves the existing problems of vanishing or exploding gradient in regular RNNs and the LSTM-RNN is capable of learning lengthy-time period dependencies due to ability to maintain information in memory for long periods of time. Each output of a LSTM depends on the input at the curent time step and the previous hidden state determined based on a previous timestep. When LSTM-RNN is used as the decoder neural network 416 in image captioning model 400, the LSTM-RNN incorporates external visual information provided by the encoder neural network 412 and intermediate representation 420 to influence linguistic choices or output at different stages but using regional features of the image derived from the encoder neural network 412 in addition to the most recently output word. The LSTM-RNN is then trained to vectorise the image-caption mixture such that the resulting vector can be used to predict the next output word.

[0091] The memory cell of the LSTM-RNN has linear dependence of its current activity and its past activities and comprises an input gate, a neuron with a self-recurrent connection to itself, a forget gate, and an output gate. The self-recurrent connection has a weight of 1, which ensures that the state of a memory cell can remain constant from time step to another in the absence of outside interference. The forget gate, input gate, and output gate all modulate the interactions between the memory cell itself and its environment.

[0092] The forget gate modulates the information flow between a time step and a previous time step, modulating the memory cell's self-recurrent connection and allowing the memory cell to remember or forget its previous state as required. The forget gate may be expressed by the equation ft = a(14// * [ht_l, xt] + bi), wherein t is the time step,fi is the forget gate at t, xt is the input data, 121_1 is the previous hidden state, Wfis the weight matrix between the forget gate and input gate, and b, is the connection bias at t. The previous hidden state kJ and the input data xf are passed through the sigmoid function which generates values between 0 and 1 which is then multiplied by the cell state of the previous timestamp. If the value generated is 0, the memory cell will forget everything, and if the value generated is 1, the memory cell will forget nothing.

[0093] The input gate modulates the input into the memory cell by determining what information is relevant to update the memory cell state, given the previous hidden state and new input data The input gate performs two operations to update the cell status. The first operation may be expressed by the equation it = o-(VVi * [ht_1, xt] + hi) and the second operation may be expressed by the equation C' = tanh(111, * [ht_1, xt] + be), wherein t is the time step, i, is the input gate at t, TY, is the weight matrix of sigmoid operator between input gate and output gate, b,. is the bias vector at t, C ' is the value generated by tanh, IC-is the weight matrix of ttinh operator between cell state information and network output, and be is the bias vector at t, with regard to icc. In the first operation, the previous hidden state 1/1_/ and the input data x, are passed through a sigmoid function which generates values between 0 and I. In the second operation, the previous hidden state 14_1 and the input data xt are passed through a tanh function which regulates the network by creating a vector C' with all the possible values between -1 and 1.

[0094] The forget gate and input gate are used to determine the new cell state. The new cell state may be determined by the equation Ct = ft x Ct_, + it x C', wherein/is the timestep, G is the cell state information of the new cell state,. f is the forget gate at t, leis the input gate at t, C,/ is the previous cell state at the previous timestep, and C ' is the value generated by tanh. The previous cell state C,_/ is pointwise multiplied with forget gate', and the value generated by tat& C ' is pointwise multiplied with input gate it The results of both pointwise multiplications are then added to generate current cell state Ct and result in the long-term memory of the network being updated.

[0095] The output gate modulates the output from the memory cell and determines the new hidden state that will be passed into the next memory cell. The output gate involves two operations. The first operation may be expressed by the equation ot = ol(Wo * Ult-i.xti + bo) and the second operation may be expressed by the equation ht = ot x tanh(Ct), wherein t is the time step, o, is the output gate at t, W is the weight matrix of output gate, b" is the bias vector with regard to Wo, and ht is the new hidden state to be passed onto the next memory cell. In the first operation, the previous hidden state h,/ and the input data)c, are passed through a sigmoid function. In the second operation, current cell state is passed through a!anti function and undergoes pointwise multiplication with the output gate o, from the first operation. If the output of the current timestep is required, a softmax activation is applied on new hidden state ht. The output at each timestep is then used to generate scene textual description 138.

[0096] When the image captioning model 400 is deployed with LSTM-RNN as the decoder neural network 416, the LSTM-RNN does not know the real caption of an input image, and also does not have a source word and destination word. The start of the caption or scene textual description 138 is generated by a first memory cell of the LSTM-RNN based on the intermediate representation 420 or feature vector extracted by the encoder neural network 412. The first word is generated based on the start of the caption and the second word is generated based on the first word. When the destination word is "end" or the length of the generated caption exceeds a predefined threshold, the scene textual description 138i s generated.

[0097] According to some embodiments, the image captioning model 400 may be trained on an ImageNet dataset available at https://image-net.org/ with stochastic gradient descent using adaptive learning rate algorithms with a learning rate of 0.005. All weights may be randomly initialised, such as the word embeddings. In some embodiments, the weights for the encoder neural network 412 may not be randomly initialised if the encoder neural network 412 was pretrained. In some embodiments, the encoder neural network 412 may be fine-tuned based on the dataset used. In some embodiments, 512 dimensions may be used for the embeddings arid the size of the LSTM memory. In some embodiments, the mini-batch size may be set as 10. In some embodiments, the image captioning model 400 may be evaluated using bilingual evaluation understudy (BLEU) evaluation metrics.

[0098] Examples of scene textual descriptions 138 generated are as follows: - There are traffic signs, pedestrians, red traffic lights and cars on the road.

- Car is turning left, moving away from the lane - There is a sign board with symbol 50 [0099] Fig. 5 is a flowchart of a method for evaluation of at least one evaluation metric, in accordance with embodiments of the present disclosure. In some embodiments, method 500 for evaluation of at least one evaluation metric may be performed by evaluation module 132. In some embodiments, method 500 for evaluation of at least one evaluation metric may be carried out in operation 212 wherein at least one evaluation metric is evaluated by comparing the scene textual description generated in operation 208 against at least one sensor data 120.

In general, method 500 for evaluation of at least one evaluation metric may comprise using entity recognition and keyword based matching techniques.

[0100] According to some embodiments, method 500 may comprise operation 504, wherein entity recognition is carried out on scene textual description generated in operation 208 to identify one or more entities mentioned in the scene textual description. In some embodiments, the scene textual description may be scene textual description 138 generated by scene description generator module 136. In some embodiments, the scene textual description may be the scene textual description 138 generated by image captioning model 400. Entity recognition involves the locating and classification of named entities mentioned in unstructured text. Examples of entities include but are not limited to traffic light, traffic sign, sign board, lane deviation, vehicle closeness. In some embodiments, entity recognition may be carried out using rule-based or learning-based techniques. Rule-based entity recognition involves using predefined (either manually or automatically) rules and patterns of named entities for entity recognition while learning-based entity recognition involves using a machine learning models trained on large sets of annotated training data for entity recognition. Examples of machine learning models commonly employed include hidden Markov models (FINTMs), conditional random fields (CRFs), support vector machines (SVMs), and maximum entropy models, or some combination thereof [0101] According to some embodiments, method 500 may comprise operation 508 wherein a value is assigned to each of the one or more entities identified in operation 504 based on at least one sensor data 120, wherein the at least one sensor data 120 is predefined for each of the one or more identified entities. In some embodiments, the assigned value may be a textual (e.g., true, false, safe, unsafe) or numerical (e.g., 0, 1) value, or any other appropriate fin value to indicate the result of the evaluation. In some embodiments, the assigned value may be a binary value. In a first example, where the identified entity is traffic sign, traffic light, or sign board, the predefined at least one sensor data 120 may be vehicle speed data from a speed sensor installed on the vehicle. In another example, where the identified entity is lane deviation, the predefined at least one sensor data 120 may be lateral offset ratio (LOR) from one or more lane centering or lane deviation sensors such as cameras, image sensors, laser, radar, or infrared sensors in a lane deviation warning system, or may be turn indicator activation sensor. In yet another example, where the identified entity is vehicle closeness, the predefined at least one sensor data 120 may be proximity data from proximity sensors installed on the vehicle.

[0102] Fig. 6 is a flowchart of a method for assigning a value to each of the one or more identified entities, in accordance with embodiments of the present disclosure. In some embodiments, method 600 for assigning a value to each of the one or more identified entities may be performed by evaluation module 132. In some embodiments, method 600 for assigning a value to each of the one or more identified entities may be carried out in operation 508 wherein a value is assigned to each of the one or more entities identified in operation 504 based on at least one sensor data 120, wherein the at least one sensor data UO is predefined for each of the one or more identified entities.

[0103] According to some embodiments, method 600 may comprise operation 604 wherein keywords in the scene textual description associated with each of the one or more identified entities are identified based on a set of keywords predefined for each of the one or more identified entities. The set of keywords may comprise words or numerical values. In some embodiments, each keyword may comprise a single word or numerical value. In some embodiments, each keyword may comprise a plurality of words or numerical values. The set of keywords may be predefined manually or automatically. The words in the scene textual description will be grouped if it is found in the set of keywords predefined for the identified entity. In a first example, where the identified entity is "traffic light", the set of predefined keywords may include "green", "amber", "red", or any other appropriate keyword(s). In another example, where the identified entity is "traffic sign" or "sign board", the set of predefined keywords may include "stop", "50", "60", or any other appropriate keyword(s).

In yet another example, where the identified entity is lane deviation, the set of predefined keywords may include "turn", "left", "right", or any other appropriate keyword(s).

[0104] According to some embodiments, method 600 may comprise operation 608 wherein the at least one sensor data predefined for each of the one or more identified entities is compared against a threshold predefined for each keyword. The threshold may be predefined manually or automatically. In one example, where the identified entity is "traffic light" and the keyword is -red", the predefined sensor data may be speed data and the predefined threshold may be "0". In another example, where the identified entity is "traffic sign" and the keyword is "50", the predefined sensor data may be speed data and the predefined threshold may be "50". In yet another example, where the identified entity is "sign board" and the keyword is "stop", the predefined sensor data may be speed data and the predefined threshold may be "0-. In yet another example, where the identified entity is "lane deviation" and the keyword is "left", the predefined sensor data may be lateral offset ratio (LOR) data and/or turn indicator activation sensor data. LOR may range from -1 to 0.25. The range of values of LOR for a lane departure zone is -1< LOR <0. The range of values of LOR for the no lane departure zone is 0< LOR <0.25 with the maximum value of LOR indicating that the vehicle is located at the centre of the lane. The predefined threshold for LOR may therefore be -I< LOR <0, and the predefined threshold of the turn indicator activation sensor data may be "on".

[0105] According to some embodiments, method 600 may comprise operation 612 wherein a value is assigned to each of the one or more identified entities based on the comparison of the at least one predefined sensor data against the predefined threshold. In some embodiments, the assigned value may be "true-or "false" based on whether the predefined sensor data is higher or lower than the predefined threshold. In some embodiments, the assigned value may be "safe" or "unsafe" based on whether the predefined sensor data is higher or lower than the predefined threshold. In some embodiments, the assigned value may be any binary values.

[0106] Figs. 7A to 7D are examples of images 116 captured by a camera 104, in accordance with embodiments of the present disclosure. Table 1 below provides examples of scene textual descriptions generated for Figs. 7A to 7D and the evaluation of the generated scene textual description against at least one evaluation metric.

Table 1

Input Image Scene textual description Entity Identified keyword Sensor Data Threshold Value Fig. "There are Sign board 60 Speed sensor data 60 "true' if speed sensor data < 7A sidewalks, 60; "false" if sign board speed sensor with symbol data" > 60 ahead and some buses and cars by the road " Fig. "There are Sign board Stop Speed sensor data 0 "true" if speed sensor data < 7B sidewalks 0; "false" if and sign speed sensor board stop on the left side of the road " data" > 0 Fig. "There are Traffic light Red Speed sensor data 0 "true" if speed sensor data < 7C traffic signs 0; "false" if and some cars speed sensor in the data" > 0 distance, the traffic light is red." Fig. "Car is Lane deviation Left Lateral offset ratio (LOR) data -1< LOR <0 "true" if -I< 7D turning left, LOR <0 and crossing the left turn lane." indicator activation sensor is on Otherwise, "false".

Left turn On indicator activation sensor data [0107] Fig. 8 is a schematic illustration of a computer system 800 within which a set of instructions, when executed, may cause one or more processors 808 of the computer system to perform one or more of the methods described herein, in accordance with embodiments of the present disclosure. It is noted that computer system 800 described herein is only an example of a computer system that may be employed and computer systems with other hardware or software configurations may be employed. In some embodiments, the one or more processors 808 may be the same as the at least one processor 112. In some embodiments, computer system 800 may be connected to one or more data storage devices 816, such connection to the one or more data storage devices 816 may be wired or wirelessly.

The data storage device 816 may include a plurality of data storage devices. The storage device 816 may include, for example, long term storage (e.g., a hard drive, a tape storage device, flash memory), short-term storage (e.g., a random-access memory, a graphics memory), and/or any other type of computer readable storage. The modules and devices described herein can, for example, utilize the one or more processors to execute computer-executable instructions and/or include a processor to execute computer executable instructions (e.g., an encryption processing unit, a field programmable gate array processing unit).

[0108] In some embodiments, computer system 800 may comprise a server computer, a laptop, a personal computer, a desktop computer, or any machine capable of executing a set of instructions that specify actions to be taken by the computer system. Computer system 800 may comprise one or more processors 808 and one or more memory 828 which communicate with each other via a bus 836. Computer system 800 may further comprise a network interface device 844 which allows computer system 800 to communicate with a network 852. In some embodiments, computer system 800 may further comprise a disk drive unit 860 which may include a machine-readable storage medium 868 on which is stored one or more sets of instructions 876 embodying one or more of the methods described herein. The one or more sets of instructions 876 may also reside in the one or more processors 808 or the one or more memory 828. In some embodiments, the one or more sets of instructions 876 may be received as a data carrier signal received by computer system 816. In some embodiments, computer system 800 may comprise an 110 interface 884 for communication with another information processing system, for receiving information through an input device 892, or for outputting information to an output device 898. In some embodiments, the input device 892 may be any input device that may be controlled by a human, such as a mouse, a keyboard or a touchscreen. In some embodiments, the output device 898 may include a display.

[0109] Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the embodiments of the present invention are intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.

Claims

CLAIMS1 A computer-implemented method for evaluation of the driving of a driver operating a vehicle, the method comprising: receiving at least one image (116) captured by at least one camera (104) mounted on the vehicle and positioned to capture at least one image of a scene surrounding the vehicle; generating a scene textual description (136) for each of the at least one images; and evaluating at least one evaluation metric by comparing the generated scene textual description against at least one sensor data (120) captured by at least one sensor (108).
2 The computer-implemented method of claim 1, wherein the at least one sensor (108) is different from the at least one camera (104).
3 The computer-implemented method of any of the preceding claims, wherein evaluating at least one evaluation metric comprises comparing each generated scene textual description against at least one sensor data captured within a predefined time margin of or at the same timepoint as the received at least one image (116) used to generate said scenetextual description.
4. The computer-implemented method of any of the preceding claims, wherein after receiving the at least one image and before generating the scene textual description, the method further comprises: generating a plurality of novel images (144) of the scene based on the received at least one image, wherein preferably the plurality of novel images comprises all possible viewpoints of the scene; and generating the scene textual description is also carried out on the plurality of novel images
5. The computer-implemented method of any of the preceding claims, wherein generating a plurality of novel images of the scene comprises using pixel Neural Radiance Field (pixelNeRF) or Single View Neural Radiance Field (SinNeRF).
6. The computer-implemented method of claim 5, wherein the pixelNeRF or SinNeRF is trained on a training dataset generated by a first set of cameras mounted on a left of a data collection vehicle, a second set of cameras mounted on a right of the data collection vehicle, 3 9 a third set of cameras mounted on a front of the data collection vehicle, and/or a fourth set of cameras mounted on a rear of the data collection vehicle.
7. The computer-implemented method of claim 6, wherein each set of cameras comprises at least five cameras mounted equidistant from each other, wherein preferably each set of cameras has a centre camera with a horizontal field of view greater than 170 degrees and a vertical field of view greater than 130 degrees and the other cameras have fields of view that are within the horizontal field of view and vertical field of view of the centre camera.
8. The computer-implemented method of any of the preceding claims, wherein generating a scene textual description for each of the at least one image comprises: passing each of the at least one image through an encoder neural network (412) to generate an intermediate representation (420) of each of the at least one image, wherein the encoder neural network is preferably a convolutional neural network (CNN) and the intermediate representation is preferably at least one feature vector, and wherein more preferably the encoder neural network is a convolutional neural network having a Visual Geometry Group (VGG) architecture; and passing the intermediate representation through a decoder neural network (416), preferably a decoder neural network using at least a part of its previous output as at least a part of its next input, more preferably a recurrent neural network (RNN), and even more preferably a recurrent neural network with long short-term memory architecture (LSTM), to generate a sequence of caption words for each of the at least one image, wherein the sequence of caption words make up the scene textual description for the at least one image.
9. The computer-implemented method of any of the preceding claims, wherein the at least one evaluation metric comprises one or more of traffic light compliance, traffic sign compliance, or lane compliance
10. The computer-implemented method of any of the preceding claims, wherein evaluating at least one evaluation metric comprises: performing entity recognition on the scene textual description to identify one or more entities mentioned in the scene textual description; and assigning a value to each of the one or more identified entities based on the at least one sensor data, wherein the at least one sensor data is predefined for each of the one or more identified entities.
11. The computer-implemented method of claim 10, wherein assigning a value to each of the one or more identified entities comprises: identifying keywords in the scene textual description associated with each of the one or more identified entities based on a set of keywords predefined for each of the one or more identified entities; comparing the at least one sensor data predefined for each of the one or more identified entities against a threshold predefined for each keyword; and assigning a value to each of the one or more identified entities based on the comparison of the at least one predefined sensor data against the predefined threshold.
12. The computer-implemented method of any of the preceding claims, wherein the evaluation metrics are determined over a journey, preferably at fixed time intervals, and the evaluation metrics are aggregated to determine a journey score for each entity.
13. A system comprising at least one camera, at least one sensor, at least one processor and a memory that stores executable instructions for execution by the at least one processor, the executable instructions comprising instructions for performing a computer-implemented method according to any of the preceding claims.
14. A vehicle comprising the system of claim 13, wherein the at least one camera is mounted on the vehicle and positioned to capture at least one image of a scene surrounding the vehicle
15. A computer program, a machine-readable storage medium, or a data carrier signal that comprises instructions, that upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of a computer-implemented method according to any one of claims 1 to 12