EP3977364A1

EP3977364A1 - Method and processing device for training a neural network

Info

Publication number: EP3977364A1
Application number: EP20813093.0A
Authority: EP
Inventors: Maryam ZIAEEFARD; Simon CORBEIL-LETOURNEAU; David Beach; Freddy Lecue; Florian MARTET
Original assignee: Thales Canada Inc
Current assignee: Thales Canada Inc
Priority date: 2019-05-31
Filing date: 2020-05-28
Publication date: 2022-04-06
Also published as: WO2020240477A1; EP3977364A4; CA3137030A1

Abstract

A method and a processing device are disclosed for training a neural network comprising obtaining a neural network to train, generating an training dataset, the generating comprising obtaining a segmented dataset comprising a plurality of multimodal data, providing an uncertainty map for each segmented multimodal data of the segmented dataset, the uncertainty map providing an indication of a performance of a corresponding segmentation, and combining each segmented multimodal data of the segmented dataset with a corresponding uncertainty map so as to provide the training dataset, the training dataset comprising a plurality of multimodal data wherein each multimodal data is segmented using the uncertainty map, training the neural network using the training dataset and providing the trained neural network.

Description

METHOD AND PROCESSING DEVICE FOR TRAINING A NEURAL NETWORK

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority to U.S. Provisional Patent Application No. 62/855,340 filed on May 31 , 2019, the specification of which is hereby incorporated by reference.

FIELD

The present technology pertains to artificial intelligence and machine learning algorithms in general. More precisely, the present technology pertains to a method and system for training a neural network by estimating and using uncertainty. BACKGROUND

Artificial Neural Networks (ANN) have demonstrated disruptive results in numerous industries, including predictive healthcare [3, 21 ]. However one of the most fundamental problem of applying ANN in critical systems, such as self-driving vehicles, is to find a rigorous methodology giving guarantees of the system performance under (un)known limits and circumstances.

The guarantees must be built on a solid understanding of the properties of the system and a clear idea of requirements. In this context, the fundamental question arises as how stable the ANN model is when facing new examples. Inspired by the rich literature on nonlinear dynamics systems and theirs stability under perturbations [26, 4, 18], a complex ANN system can be studied and characterized by a list of properties, e.g., stability and robustness.

The robustness of a system is a measure of the confidence that the claimed properties of the system will stay as expected when facing examples or perturbations never seen but coming from a distribution already explored during the training process. In a perception task, such as semantic segmentation, noise and stochasticity are unavoidable and a certain rate of failure is expected. Typically, a semantic segmentation model is forced to make a decision among a limited number of classes. In some cases, the decision should not be taken because the system is forced to choose among almost equally probable options. In such circumstances, the answer of the systems should be“I don't know”. The algorithm self-awareness of its limits is important in making decisions. In other words, knowing when a decision must be avoided is preferable than a random choice. One technique to quantify the ambiguity in decisions is through assessing the level of uncertainty of the system.

Some of the recent works on semantic segmentation exploiting uncertainty estimation as well as those assessing the robustness for vision tasks are disclosed below.

In fact, Kendall and Gal [15] present a Bayesian deep learning framework combining input-dependent aleatoric uncertainty together with epistemic uncertainty. They study models under the framework with per-pixel semantic segmentation and depth regression tasks. DeVries and Taylor [6] use uncertainty estimation as a representation for generating segmentation quality predictions in the clinical setting. They evaluate four uncertainty estimation methods, i.e., the maximum softmax probability, MC dropout, aleatoric uncertainty, and Learned Confidence Estimates [5]. Mukhoti and Gal [20] propose metrics to compare uncertainty obtained by different methods. They exploit two uncertainty methods, the predictive entropy and the mutual information between the predictive distribution and the posterior over network weights. They apply their metrics on semantic segmentation datasets for autonomous driving. Huang et al. [1 1] introduce a region-based temporal aggregation method which leverages the temporal information in videos to simulate the MC dropout sampling procedure. Kendall et al. [14] present Bayesian SegNet to predict pixel-wise class labels with a measure of model uncertainty by Monte Carlo sampling. Gast and Roth [8] propagate the uncertainty through the network. It will be appreciated that they rely on probabilistic output layers with replacing predictions from deterministic networks by distributions over the output and intermediate activations.

One technique to evaluate the robustness of semantic segmentation algorithms in the literature is to evaluate the performance of algorithms on a variety of datasets with different characteristics. Kreso et al. [17] propose a DenseNet-based ladder-style architecture for semantic segmentation trained on three different self-driving car datasets. Bulo et al. [22] present in-place activated batch normalization method to reduce the training memory of deep neural networks and apply their method on multiple object detection and self-driving car datasets. [16] introduce pixel-wise attentional gating, which learns to selectively process a subset of spatial locations at each layer of a deep convolutional network and asses their method on indoor and outdoor scene datasets. Meletis and Dubbelman [19] train a hierarchical model on multiple datasets with different classes and annotation types for per-pixel semantic segmentation. In another track of research on the robustness of semantic segmentation algorithms, researchers evaluate the robustness of models to adversarial attacks [27, 23, 9]. For example, Arnab et al. [1] evaluate the effects of different attacks on the network with multiscale processing and input transformations.

There is a need for at least one of a method and a system that will overcome at least one drawback of the prior art. BRIEF SUMMARY

In accordance with a broad aspect of the present technology, there is provided a method for training a neural network to obtain robust neural network, the method being executed by a processor, the method comprises: obtaining a neural network, generating a training dataset, the generating comprising: obtaining a segmented dataset comprising a plurality of multimodal data, obtaining an uncertainty map for each segmented multimodal data of the segmented dataset, the uncertainty map being indicative of a performance of a corresponding segmentation of the neural network, obtaining a segmentation error map for each segmented multimodal data of the segmented dataset, and combining each segmented multimodal data of the segmented dataset with a corresponding uncertainty map and a corresponding segmentation error map so as to provide the training dataset, the training dataset comprising a plurality of multimodal data each multimodal data is segmented using the corresponding uncertainty map and the corresponding segmentation error map, training the neural network using the training dataset to thereby obtain a trained neural network, and providing the trained neural network.

In one or more embodiments of the method, the training of the neural network using the training dataset to thereby obtain the trained neural network is a second training, and the method further comprises prior to said obtaining of the segmented dataset: obtaining the plurality of multimodal data, each multimodal data comprising a respective segmentation target, and first training the neural network to perform segmentation on the plurality of multimodal data using the respective segmentation targets to obtain the segmented dataset comprising the plurality of segmented multimodal data.

In one or more embodiments of the method, the method further comprises: generating the uncertainty map using the respective segmentation targets.

In one or more embodiments of the method, said first training comprises obtaining a first set of weights of the neural network, and said second training comprises obtaining a second set of weights of the neural network to obtain the trained neural network,

In one or more embodiments of the method, the method further comprises generating the segmentation error map for each segmented multimodal data using the performed segmentation with the first set of weights and the respective segmentation target.

In one or more embodiments of the method, said second training of the neural network using the training dataset to thereby obtain the trained neural network comprises: using an attentional kernel on each multimodal data, the attentional kernel being indicative of a mismatch between the corresponding uncertainty map and the corresponding segmentation error map. In one or more embodiments of the method, said using the attentional kernel comprises using a mismatch region between the corresponding uncertainty map and the corresponding segmentation error map of the multimodal data as a weight.

In one or more embodiments of the method, said using of the mismatch region is in response to determining the mismatch region between the corresponding uncertainty map and the corresponding segmentation error map based on a threshold.

In one or more embodiments of the method, said generating of the uncertainty map comprises using Bayesian predictive entropy.

In one or more embodiments of the method, said uncertainty map is a calibrated uncertainty map.

In one or more embodiments of the method, the multimodal data comprises image data, and the neural network comprises a semantic segmentation network

In one or more embodiments of the method, the method further comprises: obtaining further multimodal data, the neural network not having been trained on the further multimodal data, and providing, using the trained neural network, a segmentation prediction for the further multimodal data.

In accordance with a broad aspect of the present technology, there is provided a method for estimating a robustness of a neural network in generating segmentation predictions, the method being executed by a processor, the method comprises: obtaining the neural network, obtaining multimodal data associated with a target label, generating, by the neural network, a segmentation prediction for the multimodal data, determining, using the target label and the segmentation prediction, a prediction error, obtaining an uncertainty associated with the neural network, obtaining an attentive network, determining, by the attentive network, using the prediction error and the uncertainty parameter, an attentive uncertainty indicative of a robustness of the neural network in generating segmentation predictions. In one or more embodiments of the method, the obtaining of the uncertainty comprises obtaining an uncertainty map, and the obtaining of the prediction error comprises obtaining of a prediction error map.

In one or more embodiments of the method, said obtaining of the uncertainty map comprises generating the uncertainty map using Bayesian predictive entropy.

In one or more embodiments of the method, said obtaining of the uncertainty map comprises generating the uncertainty map using an entropy of a softmax distribution.

In one or more embodiments of the method, the attentive uncertainty is simultaneously indicative of an uncertainty of the neural network in generating predictions and an error of the neural network in generating predictions.

In one or more embodiments of the method, the method further comprises, prior to said determining of the attentive uncertainty: training the attentive network to determine attentive uncertainties using a cross-entropy loss function.

In one or more embodiments of the method, the attentive network comprises an attention layer, and at least one fully-connected layer for processing the uncertainty map and the prediction error map to obtain the attentive uncertainty map.

In one or more embodiments of the method, the determining of the attentive uncertainty map comprises determining, by the attention layer, an attention map by masking pixels having a prediction error above zero.

In one or more embodiments of the method, the determining of the attentive uncertainty comprises using a sigmoid activation function on the attention map to obtain the attentive uncertainty.

In accordance with a broad aspect of the present technology, there is provided a processing device comprising: a processor, and a non-transitory storage medium connected to the processor, the non-transitory storage medium comprising computer- executable instructions, the processor, upon executing the computer-executable instructions, is configured for: obtaining a neural network, generating a training dataset, the generating comprising: obtaining a segmented dataset comprising a plurality of multimodal data, obtaining an uncertainty map for each segmented multimodal data of the segmented dataset, the uncertainty map being indicative of a performance of a corresponding segmentation of the neural network, obtaining a segmentation error map for each segmented multimodal data of the segmented dataset, and combining each segmented multimodal data of the segmented dataset with a corresponding uncertainty map and a corresponding segmentation error map so as to provide the training dataset, the training dataset comprising a plurality of multimodal data each multimodal data is segmented using the corresponding uncertainty map and the corresponding segmentation error map, training the neural network using the training dataset to thereby obtain a trained neural network, and providing the trained neural network.

In one or more embodiments of the processing device, said training of the neural network using the training dataset to thereby obtain the trained neural network is a second training, and the processing device is further configured for, prior to said obtaining of the segmented dataset: obtaining the plurality of multimodal data, each multimodal data comprising a respective segmentation target, and first training the neural network to perform segmentation on the plurality of multimodal data using the respective segmentation targets to obtain the segmented dataset comprising the plurality of segmented multimodal data.

In one or more embodiments of the processing device, the processor is further configured for: generating the uncertainty map using the respective segmentation targets. In one or more embodiments of the processing device, said first training comprises obtaining a first set of weights of the neural network, and said second training comprises obtaining a second set of weights of the neural network to obtain the trained neural network, In one or more embodiments of the processing device, the processor is further configured for generating the segmentation error map for each segmented multimodal data using the performed segmentation with the first set of weights and the respective segmentation target.

In one or more embodiments of the processing device, said second training of the neural network using the training dataset to thereby obtain the trained neural network comprises: using an attentional kernel on each multimodal data, the attentional kernel being indicative of a mismatch between the corresponding uncertainty map and the corresponding segmentation error map.

In one or more embodiments of the processing device, said using the attentional kernel comprises using a mismatch region between the corresponding uncertainty map and the corresponding segmentation error map of the multimodal data as a weight.

In one or more embodiments of the processing device, said using of the mismatch region is in response to determining the mismatch region between the corresponding uncertainty map and the corresponding segmentation error map based on a threshold.

In one or more embodiments of the processing device, said generating of the uncertainty map comprises using Bayesian predictive entropy.

In one or more embodiments of the processing device, said uncertainty map is a calibrated uncertainty map.

In one or more embodiments of the processing device, the multimodal data comprises image data, and the neural network comprises a semantic segmentation network

In one or more embodiments of the processing device, the processor is further configured for: obtaining further multimodal data, the neural network not having been trained on the further multimodal data, and providing, using the trained neural network, a segmentation prediction for the further multimodal data.

In accordance with a broad aspect of the present technology, there is provided a processing device comprising: a processor, and a non-transitory storage medium connected to the processor, the non-transitory storage medium comprising computer- executable instructions, the processor, upon executing the computer-executable instructions, is configured for:: obtaining the neural network, obtaining multimodal data associated with a target label, generating, by the neural network, a segmentation prediction for the multimodal data, determining, using the target label and the segmentation prediction, a prediction error, obtaining an uncertainty associated with the neural network, obtaining an attentive network, determining, by the attentive network, using the prediction error and the uncertainty parameter, an attentive uncertainty indicative of a robustness of the neural network in generating segmentation predictions.

In one or more embodiments of the processing device, the obtaining of the uncertainty comprises obtaining an uncertainty map, and the obtaining of the prediction error comprises obtaining of a prediction error map.

In one or more embodiments of the processing device, said obtaining of the uncertainty map comprises generating the uncertainty map using Bayesian predictive entropy.

In one or more embodiments of the processing device, said obtaining of the uncertainty map comprises generating the uncertainty map using an entropy of a softmax distribution.

In one or more embodiments of the processing device, the attentive uncertainty is simultaneously indicative of an uncertainty of the neural network in generating predictions and an error of the neural network in generating predictions.

In one or more embodiments of the processing device, the processor is further configured for, prior to said determining of the attentive uncertainty: training the attentive network to determine attentive uncertainties using a cross-entropy loss function. In one or more embodiments of the processing device, the attentive network comprises an attention layer, and at least one fully-connected layer for processing the uncertainty map and the prediction error map to obtain the attentive uncertainty map.

In one or more embodiments of the processing device, the determining of the attentive uncertainty map comprises determining, by the attention layer, an attention map by masking pixels having a prediction error above zero.

In one or more embodiments of the processing device, the determining of the attentive uncertainty comprises using a sigmoid activation function on the attention map to obtain the attentive uncertainty. In accordance with a broad aspect of the present technology, there is provided a non- transitory computer readable storage medium for storing computer-executable instructions which, when executed, cause a computer to perform: obtaining a neural network, generating a training dataset, the generating comprising: obtaining a segmented dataset comprising a plurality of multimodal data, obtaining an uncertainty map for each segmented multimodal data of the segmented dataset, the uncertainty map being indicative of a performance of a corresponding segmentation of the neural network, obtaining a segmentation error map for each segmented multimodal data of the segmented dataset, and combining each segmented multimodal data of the segmented dataset with a corresponding uncertainty map and a corresponding segmentation error map so as to provide the training dataset, the training dataset comprising a plurality of multimodal data each multimodal data is segmented using the corresponding uncertainty map and the corresponding segmentation error map, training the neural network using the training dataset to thereby obtain a trained neural network, and providing the trained neural network. BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a flowchart which shows an embodiment of a method for building a robust machine learning system. The method comprises, inter alia, generating a training dataset.

Figure 2 is a flowchart which shows an embodiment for generating a training dataset.

Figure 3 is a flowchart which shows a diagram which shows an embodiment of a processing device which may be used for building a robust machine learning system.

Figure 4 is a schematic diagram of a communication system which may be used for building a robust machine learning system.

Figure 5 is a diagram which shows qualitative results on CamVid dataset for an embodiment of a method.

DETAILED DESCRIPTION

In the following description of the embodiments, references to the accompanying drawings are by way of illustration of an example by which the present technology may be practiced.

Terms

The terms“an aspect,” "an embodiment,” "embodiment,” "embodiments,” "the embodiment,” "the embodiments,” "one or more embodiments,” "some embodiments,” "certain embodiments,” "one embodiment,” "another embodiment" and the like mean "one or more (but not all) non-limiting embodiments of the present technology” unless expressly specified otherwise.

A reference to "another embodiment" or “another aspect” in describing an embodiment does not imply that the referenced embodiment is mutually exclusive with another embodiment (e.g., an embodiment described before the referenced embodiment), unless expressly specified otherwise.

The terms "including,” "comprising" and variations thereof mean "including but not limited to,” unless expressly specified otherwise.

The terms "a,” "an" and "the" mean "one or more,” unless expressly specified otherwise.

The term "plurality" means "two or more,” unless expressly specified otherwise.

The term "herein" means "in the present application, including anything which may be incorporated by reference,” unless expressly specified otherwise.

The term "whereby" is used herein only to precede a clause or other set of words that express only the intended result, objective or consequence of something that is previously and explicitly recited. Thus, when the term "whereby" is used in a claim, the clause or other words that the term "whereby" modifies do not establish specific further limitations of the claim or otherwise restricts the meaning or scope of the claim.

The term "e.g." and like terms mean "for example,” and thus do not limit the terms or phrases they explain. For example, in a sentence "the computer sends data (e.g., instructions, a data structure) over the Internet,” the term "e.g." explains that "instructions" are an example of "data" that the computer may send over the Internet, and also explains that "a data structure" is an example of "data" that the computer may send over the Internet. However, both "instructions" and "a data structure" are merely examples of "data” and other things besides "instructions" and "a data structure" can be "data.”

The term "i.e." and like terms mean "that is,” and thus limit the terms or phrases they explain. Neither the Title nor the Abstract is to be taken as limiting in any way as the scope of the present technology. The title of the present application and headings of sections provided in the present application are for convenience only, and are not to be taken as limiting the disclosure in any way.

Numerous embodiments are described in the present application, and are presented for illustrative purposes only. The described embodiments are not, and are not intended to be, limiting in any sense. The present technology is widely applicable to numerous embodiments, as is readily apparent from the disclosure. One of ordinary skill in the art will recognize that the disclosed technology or technologies may be practiced with various modifications and alterations, such as structural and logical modifications. Although particular features of the disclosed technology or technologies may be described with reference to one or more particular embodiments and/or drawings, it should be understood that such features are not limited to usage in the one or more particular embodiments or drawings with reference to which they are described, unless expressly specified otherwise.

With all this in mind, the present technology is directed to a method and a system for building a robust machine learning system.

It will be appreciated that the method disclosed herein may be implemented according to various embodiments.

Processing Device

More precisely and now referring to Fig. 3, there is shown a processing device 300 which may be used to implement one or more embodiments of the present technology. In fact, it will be appreciated that the processing device 300 may also be referred to as a computer.

In one or more embodiments, the processing device 300 is selected from a group consisting of desktop computers, laptop computers, tablet PC’s, servers, smartphones, etc. In the embodiment shown in Fig. 3, the processing device 300 comprises a central processing unit (CPU) 302, also referred to as a processor, a graphic processing unit (GPU) 316, input/output devices 304, a display device 306, communication ports 308, a data bus 310 and a memory unit 312.

The central processing unit 302 is used for processing computer instructions. The skilled addressee will appreciate that various embodiments of the central processing unit 302 may be provided. As a non-limiting example, The central processing unit 302 may be implemented as one or more of various processing means such as a microprocessor, a controller, a digital signal processor (DSP), a processing device with or without an accompanying DSP, or various other processing devices including integrated circuits such as an ASIC (application specific integrated circuit), an FPGA (field programmable gate array), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, processing circuitry, or the like. As a non-limiting example, the central processing unit 302 may be configured to execute instructions stored in the memory unit 318 or to execute hard coded functionality in response to receiving data or control signals from other modules via the data bus 310. As such, whether configured by hardware or software methods, or by a combination thereof, the central processing unit 302 may represent an entity (for example, physically embodied in circuitry) capable of performing algorithms and/or operations described herein when corresponding instructions are executed.

The graphics processing unit 316 is used for processing specific computer instructions. It will be appreciated that a memory unit 318 is operatively connected to the graphics processing unit 316.

The input/output devices 304 are used for inputting/outputting data into the processing device 300.

The display device 306 is used for displaying data to a user. The skilled addressee will appreciate that various types of display device 306 may be used. In one or more embodiments, the optional display device 306 is a standard liquid crystal display (LCD) monitor.

The communication port 308 is used for operatively connecting the processing device 300 to various processing devices.

The communication port 308 may comprise, for instance, universal serial bus (USB) ports for connecting a keyboard and a mouse to the processing device 300.

The communication port 308 may further comprise a data network communication port such as an IEEE 802.3 port for enabling a connection of the processing device 300 with another processing device.

The communication port 308 may enable connecting the processing device 300 to a communication network (not depicted) such as the internet for receiving and transmitting data.

In this regard, the communication port 308 may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a communication network (not depicted). In some examples, the communication port 308 may support wired communication using a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.

The skilled addressee will appreciate that various alternative embodiments of the communication port 308 may be provided.

The memory unit 312 is a non-transitory storage medium used for storing computer-executable instructions. The computer-executable instructions may be executed as a non-limiting example by the central processing unit 302 and/or the graphics processing unit 316.

The memory unit 312 may include, for example, one or more volatile and/or non-volatile memories and caches with, e.g., different data storage sizes and speeds. The memory unit 312 may be configured to store information, data, applications, instructions or the like for enabling the processing device 300 to perform various functions in accordance with example aspects of this application

The memory unit 312 may comprise a system memory such as a high-speed random access memory (RAM) for storing system control program (e.g., BIOS, operating system module, applications, etc.) and a read-only memory (ROM). As a non-limiting example, the memory unit 312 may have 128 GB of DDR4 RAM.

It will be appreciated that the memory unit 312 comprises, in one or more embodiments, an operating system module 314.

It will be appreciated that the operating system module 314 may be of various types.

In one or more embodiments, the memory unit 318 comprises an application 320 for building a robust machine learning system such as a neural network.

In one or more embodiments, the memory unit 318 is operatively connected to the graphics processing unit 316 and has a suitable size of VRAM. The skilled addressee will appreciate that various alternative embodiments may be possible.

The memory unit 318 is further used for storing data 322. The memory unit 318 may include, for example, one or more volatile and/or non-volatile memories and caches with, e.g., different data storage sizes and speeds. The memory unit 318 may be configured to store information, data, applications, instructions or the like for enabling the processing device 300 to perform various functions in accordance with example aspects of this application.

The skilled addressee will appreciate that the data 322 may be of various types.

Communication System With reference to Fig. 4, a communication system 400 will be described in accordance with one or more non-limiting embodiments of the present technology.

It will be appreciated that the communication system 400 as shown is merely an illustrative implementation of the present technology. Thus, the description thereof that follows is intended to be only a description of illustrative examples of the present technology. In some cases, what are believed to be helpful examples of modifications to the communication system 400 may also be set forth below. This is done merely as an aid to understanding, and, again, not to define the scope or set forth the bounds of the present technology. These modifications are not an exhaustive list, and, as a person skilled in the art will understand, other modifications are likely possible. Further, where this has not been done (i.e., where no examples of modifications have been set forth), it should not be interpreted that no modifications are possible and/or that what is described is the sole manner of implementing that element of the present technology. As a person skilled in the art will appreciate, this is likely not the case. In addition, it will be appreciated that the communication system 400 may provide in certain instances simple implementations of one or more embodiments of the present technology, and that where such is the case they have been presented in this manner as an aid to understanding.

The communication system 400 comprises inter alia the processing device 300, and a database 430, communicatively coupled over a communications network 440 via respective communication links 445.

The processing device 300 is configured to: (i) access the set of MLAs 450; (ii) receive training datasets, validation datasets and testing datasets; (iii) train one or more of the set of MLAs 450 to perform segmentation on one or more of the training datasets, validation datasets and testing datasets; (iv) estimate uncertainty and segmentation errors of one or more of the set of MLAs 450; (v) train the set of MLAs 450 based on the estimated uncertainty and segmentation errors; (v) output a trained set of MLAs 450; and (vi) use the set of MLAs 450 to perform segmentation. How the processing device 300 is configured to do so will be explained in more detail herein below.

It will be appreciated that in the embodiment illustrated in Fig. 4, the processing device 300 can be implemented as a conventional computer server. In a non-limiting example of one or more embodiments of the present technology, the processing device 300 is implemented as a server running an operating system (OS). Needless to say that the processing device 300 may be implemented in any suitable hardware and/or software and/or firmware or a combination thereof. In the disclosed non-limiting embodiment of present technology, the processing device 300 is a single server. In one or more alternative non-limiting embodiments of the present technology, the functionality of the processing device 300 may be distributed and may be implemented via multiple servers (not shown).

Machine Learning Algorithms (MLAs)

The set of MLAs 450 comprises inter alia one or more neural networks 460, and an attentive network 470.

Neural Network

The one or more neural networks 460, which will be referred to as the neural network 460, is configured to: (i) obtain multimodal data; (ii) extract features from the multimodal data; and (iii) generate, based on the extracted features, a segmentation prediction.

In one or more embodiments, the multimodal data comprises image data, and the neural network 460 is configured to perform segmentation on the image data by predicting class labels for one or more pixels in the image data, such that the one or more pixels are labelled with the predicted class of its enclosing object, which will be referred to as a segmentation prediction. It will be appreciated that in one or more alternative embodiments, the multimodal data comprise at least one of image data, video data, audio data, and signal data.

In one or more alternative embodiments, the neural network 460 is configured to provide a prediction score for segmentation predictions, where the prediction score is indicative of a confidence of the neural network 460 in performing the prediction.

The neural network 460 comprises a plurality of layers. The plurality of layers may comprise inter alia one or more fully connected layers, pooling layers, convolutional layers.

It will be appreciated that the neural network 460 may be of various types and the neural network 460 may have different have layers. In one or more embodiments, the neural network 460 is a deep neural network.

In one or more embodiments, the neural network 460 is implemented as a semantic segmentation network. Non-limiting examples of semantic segmentation networks include DeepLabv3+, Tiramisu, Segnet., U-Net, Feature Pyramid Network (FPN), Pyramid Scene Parsing Network (PSPNet), Mask R-CNN, path aggregation network (PPANet), Context Encoding Network (EncNet).

In one or more embodiments, the neural network 460 comprises or is parametrized by model parameters and hyperparameters.

The model parameters are configuration variables of the neural network used to perform predictions and which are estimated or learned from training data, i.e. the coefficients are chosen during learning based on an optimization strategy for outputting the prediction.

In one or more embodiments, the model parameters of the neural network 460 include weights and biases. It will be appreciated that the type and number of depend the architecture and how the neural network 460 is implemented. The hyperparameters are configuration variables of the neural network 460 which determine the structure of the neural network 460 and how the neural network 460 is trained.

In one or more embodiments, the hyperparameters include one or more of: a number of hidden layers and units, an optimization algorithm, a learning rate, momentum, an activation function, a minibatch size, a number of epochs, and dropout.

In one or more embodiments, the hyperparameters may be initialized using one or more of a manual search, a grid search, a random search and Bayesian optimization.

During training, the neural network 470 is provided with a training dataset which comprises data as well as labels (also known as annotations or targets). During training (also known as learning), the neural network 470 processes the input data and outputs a prediction, which is compared against the label using a loss function, and the errors are propagated back and the model parameters are adjusted until convergence, when the model reaches a statistically desired point or accuracy.

It will be appreciated that the neural network 460 may be trained using standard techniques as well as the techniques which will be described in more detail below.

The uncertainty of the neural network 460 may be determined. Different types of uncertainty will now be described.

Tvoes of Uncertainty

It will be appreciated that the uncertainty of a classification process may be obtained by several methods [10, 12, 13].

Epistemic Uncertainty: This type of uncertainty (also known as systematic uncertainty) is associated with the learned weights ^w during the training given training data D_train. The uncertainty comes from a misrepresentation of the information hidden in the training data. This uncertainty may be reduced to an acceptable level given enough data. One technique to approximate the epistemic uncertainty is the variational dropout method introduced by [7] as an approximations of Bayesian inferences. Dropout may be viewed as employing the Bernoulli distribution as the approximating distribution over w. At test time, the prediction is estimated by sampling model T times which is referred to as Monte Carlo dropout (MC). The epistemic uncertainty is obtained by calculating the variance of the T sampled models, which is expressed using equation (1 ):

where x is a given test input and y^* is the output variable, and w_t are the model parameters on the t^th Monte Carlo sample.

Aleatoric Uncertainty: Aleatoric uncertainty captures the uncertainty with respect to information which the data cannot explain. For example, aleatoric uncertainty in images can be attributed to occlusions or lack of visual features or over-exposed regions of an image. Providing more training examples does not decrease this type of uncertainty. Kendall and Gal [15] calculate aleatoric uncertainty placed over the logit space, the network outputs and corrupt the logit space with Gaussian noise. The corrupted vector is then squashed with the softmax function to obtain the probability vector for each pixel. The uncertainty is approximated through Monte Carlo integration, and sample logits through the softmax function.

Predictive Entropy: Aleatoric uncertainty and epistemic uncertainty may be used to estimate predictive uncertainty [24], the confidence in a prediction. The predictive entropy given a test input ^x and the training data may be

approximated as:

where c ranges over all the classes, and is the softmax probability of input x being in class c.

In the context of the present technology, the uncertainty of the neural network 460 in a segmentation prediction may be obtained. It will be appreciated that the uncertainty may be obtained using different techniques. The uncertainty is indicative of a confidence of the model in the prediction, which in turn is indicative of a performance of the model in the prediction.

In one or more embodiments, the processing device 300 may determine and output an uncertainty map based on the segmentation prediction of the neural network 460. An uncertainty map is a spatial map which enables determining when a model is likely to make an incorrect prediction and/or enables determining when input data is out of distribution.

The uncertainty map provides, for at least a portion of pixels in the segmentation prediction, a respective uncertainty. In one or more embodiments, the uncertainty map provides the respective uncertainty for each pixel in the segmentation prediction. It will be appreciated that the uncertainty may be determined for one or more pixels, groups of pixels, and images.

In one or more embodiments, the uncertainty map is generated using Bayesian predictive entropy, which captures both epistemic and aleatoric uncertainties, and which is indicative of uncertainty in the model and/or in the data. In one or more alternative embodiments, the uncertainty map may be generated using entropy of the softmax distribution.

Attentive Network

In one or more embodiments, the set of MLAs 250 includes the attentive network 470. In one or more alternative embodiments, the attentive network 470 may be part of the neural network 460.

The attentive network 470 is configured to inter alia : (i) obtain a segmentation prediction of the neural network 460, an uncertainty, and a target or label; (ii) generate, using the segmentation prediction, the uncertainty, and the target, an attentive uncertainty. The attentive uncertainty may also be referred to as refined uncertainty.

It will be appreciated that in one or more embodiments of the present technology, the attentive uncertainty may be obtained in the form of an attentive uncertainty map.

In one or more embodiments, the processing device 300 determines a prediction error map (or error map), which indicates a difference between the segmentation prediction of the neural network 460, and the target. It will be appreciated that the prediction error may be determined for one or more pixels, groups of pixels, and images.

In one or more embodiments, the attentive network 470 determines an attentive uncertainty using the uncertainty map and the error map. In one or more embodiments, the attentive uncertainty is in the form of an attentive uncertainty map.

The attentive network 470 comprises or is parametrized by model parameters and hyperparameters.

The attentive network 470 comprises inter alia an attention layer and one or more fully connected layers, where the output of the final layer is passed through a sigmoid activation function to obtain the attentive uncertainty. In one or more embodiments, the attentive uncertainty has a value between zero (0) and one (1 ).

It will be appreciated that activation functions other than a sigmoid activation function may be used. The attentive uncertainty may be obtained for each predicted pixel in the segmentation prediction in the form of an attentive uncertainty map. In one or more alternative embodiments, the attentive uncertainty may be obtained for one or more pixels or groups of pixels in an image.

The attentive uncertainty is simultaneously indicative of a confidence of the neural network 460 in the segmentation prediction and an error of the neural network 460 in the segmentation prediction.

In one or more embodiments, the attentive uncertainty is expressed using equation (3)

Where a is the attentive network 470 employed on top of the neural network 460 s, where the attentive network 470 receives as input one-hot encoded ground- truth or target label y, p is the segmentation prediction output by the neural network 460, and u is the uncertainty of the neural network 460 .

If the neural network 460 is confident in the segmentation prediction, i.e. the uncertainty is low, but fails in the prediction, i.e. if there is a prediction error for the input, the value of the attentive uncertainty for that input should be close to 1.

The attention layer of the attentive network 470 enables finding problematic regions by attending to all pixels where deviation of the prediction error from the uncertainty is above a given threshold.

In one or more embodiments, the output of the attention layer is computed using equation (4):

where i and j are the index of pixels in the input image and ^ ^v*·^ shows the distance between softmax outputs and ground-truth or target labels.

Different techniques may be used to determine the threshold th. In one or more embodiments, the threshold may be determined based on the average deviation of e from u over a validation set.

The attentive network 470 is trained to generate the attentive uncertainty.

In one or more embodiments, the attentive network 470 is trained to generate an attentive uncertainty using an attentional loss function.

The attentional loss function is based on the assumption that the case where the prediction error e is true and the uncertainty u is high is the ideal case. If the prediction error e is true and the uncertainty u is low, the value of the uncertainty « has to be increased. If e is false and the uncertainty u is low, it is also an ideal case. Conversely, when the prediction error e is false and the uncertainty u is high, the uncertainty u should be decreased. The loss function used for training the attentive network 470 should take these factors into account.

In one or more embodiments, these factors may be interpreted as a crossentropy loss, where the target value u_gt is the prediction error e and the output of the attentive network 470 is the predicted attentional uncertainty u_a.

The attentional loss function L_a is used to train the attentional network a to produce u_a.

In one or more embodiments, the attentional loss function may be expressed using equation (5):

Where attended pixels are integrated in the loss function so that they have

a greater effect in the summation function where the cross entropy penalizes the deviation of the attentional uncertainty u_a from the prediction error e.

Uncertainty-Error Neural Network training

In one or more embodiments, an attentional uncertainty-error based mechanism is introduced during training of the neural network 460 in order to leverage the uncertainty in the prediction of the neural network 460. The mechanism acts as a kernel or filter that highlights confident miss-classified pixels (i.e., the prediction error e is true and the uncertainty u is low).

In one or more embodiments, the training of the neural network 460 is conducted in two steps: a first training procedure and a second training procedure.

In a first training procedure, the neural network 460 is trained to the best performance possible with standard training. The first training procedure starts with the best performing weighs W^t0 obtained before the over-fitting on each training examples X_k in the training dataset presented to the neural network 460. In one or more embodiments, the uncertainty u is estimated using Bayesian predictive entropy and the associated label y_k enables to compute the prediction error e.

In a second training procedure, an attentional kernel identifies the regions where errors occur during training of the neural network 460. Each location where the uncertainty u is high is penalized. The attentional kernel can be described as a weight map that is introduced to give some pixels more importance during the training of the neural network 460. In one or more embodiments, the attentional kernel for a given example k may be expressed using equation (6):

Where k stand for the index of the example (X_k),i and j refer to the two index required to locate a pixel in an image. The function is strictly increasing the weight of each pixel, at least ^K”J is equal to one when the uncertainty, ¾, equal zero when there is no error, ^eb = 0. The hyper-parameter a can be set in order to increase the uncertainty contribution in the penalization properties of the kernel K *^mJ. The parameter zeta (z) is introduced in order to modulate the contribution of a miss-classification0 in conjunction to a high level of confidence, when is small. The parameter

epsilon is used to avoid a possible divergence of former term (u-e; uncertainty-error one). The parameter beta (b) is design to modulate the contribution of a low level of uncertainty on miss- classified pixel, the lower beta is set, the steepest will the behavior of the u-e therm.

As a non-limiting example, in one or more embodiments, the parameters may be set as Î = 0.001 and b = 0.1.

The impact of uncertainty on the dynamics of the attentional kernel may be determined. In cases where the uncertainty u « 1 (i.e., the neural network 460 is very confident in a prediction), a misclassification will be strongly penalized as much as the neural network 460 is confident of the decision. In the opposite cases (u » 1) a high uncertainty at the location of a misclassification will decrease the uncertainty-error (u- e) contribution but will increase linearly with the level of uncertainty.

To simplify the reading, are removed from the terms. A pseudo code 1 for the training with the attentional kernel for the pixel in a given image m is provided below. During training, an uncertainty map (U-map) is determined on each example of a given batch, before back-propagation is applied at the end of the batch. For each example forward passes are done with the modified initial weights , noted

Theses inferences are used to computed the uncertainty map (U-map). The error map (E-map) is computed with the predictions obtained with the initial weights w^h. The uncertainty map (U-map) and the error map (E-map) are combined to obtain a kernel (see equation (1 ), and K is used to modify the contribution of each pixel to the loss function. This is done for each example of the batch the backpropagation follow and the initial w^k is modified to obtain

Pseudo code 1 Training with attentional kernels

Definition: (t),(b),(e) and (d) are respectively indexes running from 1 to (the number of training epochs), 1 to (the number batch per epochs), 1 to (the number examples per batch) and 1 to (the number of perturbed (dropout) copies

of a given example).

(i), (j) as defined previously are indices sued to locate a pixel in an image and run respectively from 1 to and , the number of pixels in the horizontal and

vertical directions, respectively height and width of the image. : learning weights not perturbed at the beginning of the batch (b). state of the learning weights not perturbed with dropout mask at the beginning of the batch (b) for a given example (e), (identically equal to

copy (d) of the learning weights perturbed with dropout mask for the batch (b) for

the example (e).

«v : class weights to balance the unequal numbers of pixels part of each classes. : predicted probabilities outputs from the networks obtained

with for the example predicted probabilities outputs of the networks obtained with perturbed copy ^Wb with dropout mask is different for each (e) example and (d) copy.

Implementation

As a non-limiting example, during experiments, the CamVid test set was used during the training, and the neural network 460 was implemented as Bayesian Tiramisu [15] as its backbone. It will be appreciated that any other architecture may be used in semantic segmentation. It will be further appreciated that Tiramisu was chosen, since it reported the state-of-the art results on the CamVid dataset. CamVid is a road scene segmentation dataset and each pixel is associated with one of 11 classes, e.g., sky, building, etc. All frames are scaled down to 480 c 360 pixels in the experiments. Results were generated using a machine equipped with nVidia GTX 1060 GPU.

For simplicity only one fully connected layer is considered in attentional network a. The approach described above requires two hyper-parameters a and z which are set to 3.1 and 2.1 , respectively. A preliminary experience was conducted to select the optimal values for these parameters during training to make it robust across different classes.

In one or more embodiments, the processing device 300 may execute one or more of the set of MLA 450. In one or more alterative embodiments, one or more of the set of MLA 450 may be executed by another server (not depicted), and the processing device 300 may access the one or more of the set of MLA 450 for training or for use by connecting to the server (not shown) via an API (not depicted), and specify parameters of the one or more of the set of MLA 450, transmit data to and/or receive data from the MLA 450, without directly executing the one or more of the set of MLA 450.

As a non-limiting example, one or more MLAs of the set of MLAs 450 may be hosted on a cloud service providing a machine learning API.

Database

A database 430 is communicatively coupled to the processing device 300 via the communications network 440 but, in one or more alterative implementations, the database 430 may be communicatively coupled to the processing device 300 without departing from the teachings of the present technology. Although the database 430 is illustrated schematically herein as a single entity, it will be appreciated that the database 430 may be configured in a distributed manner, for example, the database 430 may have different components, each component being configured for a particular kind of retrieval therefrom or storage therein. The database 430 may be a structured collection of data, irrespective of its particular structure or the computer hardware on which data is stored, implemented or otherwise rendered available for use. The database 430 may reside on the same hardware as a process that stores or makes use of the information stored in the database 430 or it may reside on separate hardware, such as on the processing device 300. The database 430 may receive data from the processing device 300 for storage thereof and may provide stored data to the processing device 300 for use thereof.

In one or more embodiments of the present technology, the database 430 is configured to inter alia: (i) store training datasets, validation datasets and testing datasets of multimodal data; (ii) store parameters of the set of MLAs 250; (iii) store predictions, uncertainty maps and error maps; and (iii) store multimodal data on which the set of MLAs 250 performs predictions after training.

Communication Network

In one or more embodiments of the present technology, the communications network 440 is the Internet. In one or more alternative non-limiting embodiments, the communications network 440 may be implemented as any suitable local area network (LAN), wide area network (WAN), a private communication network or the like. It will be appreciated that implementations for the communications network 440 are for illustration purposes only. How a communication link 445 (not separately numbered) between the processing device 300, the database 430, and/or another electronic device (not shown) and the communications network 440 is implemented will depend inter alia on how each electronic device is implemented.

Training Method

Now referring to Fig. 1 , there is shown an embodiment of a method 100 for building a robust machine learning system. In one or more embodiments, the processing device 300 stores in at least one of the memory unit 312 and the memory unit 318, computer-executable instructions which, when executed, cause one of the central processing unit (CPU) 302 and the graphic processing unit (GPU) 316 to execute the method 100.

According to processing step 102, a neural network 460 is obtained.

It will be appreciated that the neural network to train may be of various types. In one or more embodiments, the neural network to train is a standard semantic segmentation network, e.g. DeepLabv3+, Tiramisu, Segnet. The skilled addressee will appreciate that various embodiments may be provided for the neural network to train.

Non-limiting examples of semantic segmentation networks include U-Net, Feature Pyramid Network (FPN), Pyramid Scene Parsing Network (PSPNet), Mask R- CNN, path aggregation network (PPANet), Context Encoding Network (EncNet).

Moreover, it will be appreciated that the neural network 460 may be obtained according to various embodiments.

In one or more embodiments, the neural network 460 is a neural network having been trained according to standard training techniques to the best performance possible.

In one or more embodiments, the neural network 460 is obtained from the memory unit 318 of the processing device 300.

In another embodiment, the neural network 460 is obtained from a user interacting with the processing device 300.

In one or more embodiments, the neural network 460 is obtained by initializing model parameters and hyperparameters of the neural network 460. In one or more alternative embodiments, the neural network 460 is obtained with learned model parameters .

In or more other embodiments, the neural network 460 to train is obtained from another processing device, not shown, operatively connected to the processing device 300. It will be appreciated that the other processing device may be connected to the processing device 300 using the communication network 450. In one or more alternative embodiments, the neural network 460 is obtained from the database 430 over the communication network 440.

Still referring to Fig. 1 and according to processing step 104, a training dataset is generated.

Dataset Generation Method

Now referring to Fig. 2, there is shown an embodiment of a method 200 for generating the training dataset. The method is executed by the processing device 300. In one or more embodiments, the processing device 300 stores in at least one of the memory unit 312 and the memory unit 318, computer-executable instructions which, when executed, cause one of the central processing unit (CPU) 302 and the graphic processing unit (GPU) 316 to execute the method 200.

According to processing step 202, a segmented dataset comprising a plurality of multimodal data is obtained.

In one or more embodiments, the multimodal data comprises image data. In such embodiment, the segmented dataset comprises a plurality of segmented image data. In one or more embodiments, the image data is 2D image data.

In one or more other embodiments, the multimodal data comprises sound data. In such embodiments, the segmented dataset comprises a plurality of segmented sound data.

In one or more alternative embodiment, the multimodal data comprise at least one of image data, video data, audio data, and signal data.

In one or more embodiments, the segmented dataset is a dataset on which the neural network 460 has performed segmentation predictions. As a non-limiting example, in one or more embodiments where each segmented data comprises image data, the neural network 460 may have performed semantic segmentation and predicted a mask for each recognized object in the image data and assigned a respective class to each of the recognized object in the image data.

According to processing step 204, an uncertainty map is provided for each segmented multimodal data.

It will be appreciated that an uncertainty map indicates when the neural network 460 is likely to make an incorrect prediction or when an input may be out of distribution. It will be appreciated that in one or more alterative embodiments, the uncertainty map is generated by taking into account the correlation between the uncertainty and the prediction error.

In one or more embodiments, the uncertainty map may be provided by being generated using deterministic approaches and Bayesian techniques.

In one or more embodiments, the uncertainty map is generated using Bayesian predictive entropy, which is indicative of uncertainty in the model and/or the data. In one or more alterative embodiments, the uncertainty map may be generated using the entropy of the softmax distribution.

In the case where the multimodal data comprises image data, the uncertainty map comprises for each pixel of the image data an indication of a corresponding reliability of the corresponding segmentation performed by the neural network 460 for that particular image data.

According to processing step 206, each segmented multimodal data is combined with a corresponding uncertainty map.

It will be appreciated that the purpose of the combination is to use the information of each uncertainty map in each corresponding segmented multimodal data. For instance, in the case where the multimodal data comprises segmented image data, the purpose of the combination is to use the information of each uncertainty map in each corresponding segmented image data. Accordingly, for each pixel of an image data, the corresponding data from the uncertainty map is used.

In one or more embodiments, the method further comprises obtaining a generalized version dataset. The obtaining of the generalized version dataset comprises obtaining a compression version of the training dataset, applying a segmentation of the compressed version of the training dataset, providing an uncertainty map corresponding to the compressed version of the training dataset, combining the segmented compressed version of the training dataset and the uncertainty map corresponding to the compressed version of the training dataset and uncompressing the combination of the segmented compressed version of the training dataset and the uncertainty map.

It will be also appreciated that in one or more embodiments, the combining is performed using a refined uncertainty map generated using the uncertainty map. The refined uncertainty map may be generated by first obtaining a set of uncertainty maps and deriving a unique uncertainty map through a major voting on the set of uncertainty maps.

It will be appreciated that a threshold may be used for the purpose of the combining. The threshold may depend on the data of the uncertainty map. For instance and in accordance with an embodiment, any pixel having an uncertainty level above 80% may be considered as uncertain and the result of the combination may be an indication that such pixel cannot be used for the purpose of a training.

In one or more embodiments, the refined uncertainty map comprises an attentive uncertainty map generated using the uncertainty map, the target labels, and the segmentation predictions performed by the neural network 460. In one or more embodiments, the attentive uncertainty map is generated using the uncertainty map, and an error map. The error map is obtained using the segmentation prediction and the target or label of the segmented multimodal data.

In one or more embodiments, the attentive uncertainty map is obtained by being generated by an attentive network 470 having been trained to generate attentive uncertainty maps. The attentive network 470 comprises an attention layer and one or more fully connected layers.

The attention layer of the attentive network 470 generates as an output attended pixels by masking out values of the prediction error that do not match values of the uncertainty.

The attentive network 470 may have been trained to generate attentive uncertainty maps using an attentional loss function which is similar to a cross-entropy loss function where the target value is a prediction error and the predicted value is the attentive uncertainty, and which is weighted by the output of the attention layer, i.e. the attended pixels.

It will be appreciated that the combination may be performed according to various embodiments as shown herein below.

Now referring to Fig. 1 and according to processing step 106, the neural network is trained. It will be appreciated that in one or more embodiments, the neural network is trained with the generated training dataset using the processing device 300. The skilled addressee will appreciate that the learning may be performed according to various embodiments described above.

In one or more embodiments, the neural network 460 is trained according to standard training techniques using the training dataset. As a second step, the neural network 460 is trained using the refined uncertainty map, which comprises an attentional kernel that identifies regions in which errors occur in each segmented data, such that regions where the uncertainty is high or above a threshold is penalized, which gives some pixels more importance in the training. According to processing step 108, the trained neural network 460 is provided. In one or more embodiments, the trained neural network 460 is provided according to its final model parameters, i.e. biases and weights, obtained during training.

It will be appreciated that the trained neural network 460 may be provided according to various embodiments.

More precisely and in accordance with one or more embodiments, the trained neural network 460 is stored in the memory unit of the processing device 300.

In one or more other embodiments, the trained neural network 460 is provided to the remote processing device operatively coupled to the processing device 300.

In one or more alternative embodiments, the trained neural network 460 is stored in the database

The skilled addressee will appreciate that various alternative embodiments may be provided for providing the trained neural network 460.

It will be appreciated that the neural network 460 may be provided by providing the model parameters resulting from the training for the specific application.

It will be appreciated that the application 320 for building a robust machine learning system comprises instructions for obtaining a neural network 460 to train. The application 320 for building a robust machine learning system further comprises instructions for generating an training dataset, the generating comprising obtaining a segmented dataset comprising a plurality of multimodal data, providing an uncertainty map for each segmented multimodal data of the segmented dataset, the uncertainty map providing an indication of a performance of a corresponding segmentation, and combining each segmented multimodal data of the segmented dataset with a corresponding uncertainty map so as to provide the training dataset, the training dataset comprising a plurality of multimodal data wherein each multimodal data is segmented using the uncertainty map. The application 320 for building a robust machine learning system further comprises instructions for training the neural network using the training dataset to thereby build a robust machine learning system; and instructions for providing the trained neural network.

It will be appreciated that there is also disclosed a non-transitory computer readable storage medium for storing computer-executable instructions which, when executed, cause a computer to perform a method for building a robust machine learning system, the method comprising obtaining a neural network to train; generating an training dataset, the generating comprising obtaining a segmented dataset comprising a plurality of multimodal data, providing an uncertainty map for each segmented multimodal data of the segmented dataset, the uncertainty map providing an indication of a performance of a corresponding segmentation, and combining each segmented multimodal data of the segmented dataset with a corresponding uncertainty map so as to provide the training dataset, the training dataset comprising a plurality of multimodal data wherein each multimodal data is segmented using the uncertainty map; training the neural network using the training dataset to thereby build a robust machine learning system; and providing the trained neural network.

Uncertainty is a natural part of any predictive system. Uncertainty estimation helps to produce spatial maps, or uncertainty maps, from it can be observed where and why a system might fail. It also quantifies an image-level prediction of failure, which is useful for isolating specific cases and removing them from automated pipelines. An uncertainty map detects when a model is likely to make an incorrect prediction, or when an input may be out-of-distribution [5]. Researchers use approaches in which the neural network is manipulated during training to produce uncertainty that reflects the ability of models to produce a correct prediction for given inputs. In the literature, uncertainty is divided into two main types, i.e., epistemic and aleatoric [15].

It will be appreciated by the skilled addressee that the rationale behind using uncertainty as a robustness metric in semantic segmentation is that if a model is confident about its prediction, it should be accurate on that prediction [20]. If an algorithm is trained and tested with a high level of accuracy only because the score of the right decision is slightly higher than other options, the level of uncertainty can indicate that the decision process is unstable and susceptible to shift to a completely different conclusion. If the decision is right and the uncertainty is low, small changes in inputs should not change the decision. In other words, an algorithm is not robust if its level of uncertainty and prediction error are not correlated.

The correlation between the uncertainty and prediction error is not considered in the prior art uncertainty estimation methods. There are regions where the algorithm is confident but it fails in prediction (Fig. 5). Therefore, the prior art uncertainty estimation methods are not suitable to assess the robustness of a semantic segmentation algorithm. In order to resolve this issue, two complementary attention- based approaches are disclosed to improve both the estimated uncertainty and the prediction. More precisely, a method called attentional uncertainty is disclosed to estimate the uncertainty. A neural network-based semantic segmentation model is trained and a post-hoc uncertainty is estimated and an attention mechanism is applied to find regions in the space of uncertainty for which the model output is not accurate, as shown in Fig. 5. A smaller network is then trained to correct the estimated uncertainty in the attended regions. The attended regions and their uncertainty values are then employed as weight kernels in the loss function in order to improve the performance of our semantic segmentation algorithm.

There is therefore disclosed a framework which discloses a novel measure to attribute the robustness of semantic segmentation algorithms to the uncertainty. It will be appreciated that there is also disclosed attentional uncertainty estimation which is more suitable to evaluate the robustness. The estimated uncertainty is integrated through the network to improve the prediction results. The uncertainty integration leads to attentional kernels, which can be interpreted as weights to penalize the confident misclassified pixels.

It will be appreciated that in order to keep the uncertainty tightly related to the definition of robustness, there should be a correlation between prediction error and uncertainty estimates. In other words, if a model is confident about its prediction, it should be accurate on that prediction. However, this correlation does not seem to be true in the prior art uncertainty estimation methods. Two complementary approaches are therefore disclosed to overcome this limitation. A first approach seeks to correct the uncertainty estimates where the error and uncertainty are not correlated. A second approach leverages the uncertainty during the training to improve the prediction output. These two approaches are explained in more detail in the following sections.

Qualitative results

Now referring to Fig. 5, there are shown some qualitative results on CamVid test images. In case of the baseline uncertainty (predictive uncertainty), high uncertainty are observed inside the boundaries of objects which the model is confused about. However, when the baseline uncertainty map is compared with the prediction error map, regions can be seen where the uncertainty is low (dark blue) but the error is true (red color) such as the tree/building area in the second row or some parts of road in the third 182 row. In attentional uncertainty, the uncertainty of problematic regions has been corrected and the uncertainty maps and error maps are more correlated. This observation supports that attentional uncertainty improves uncertainty estimates over the baseline.

It will be appreciated that two complementary approaches are presented to augment the robustness for the task of semantic segmentation. It is shown that understanding what a model does not know can be a metric to evaluate the performance of a neural network used as a segmentation model. The robustness is attributed to the uncertainty by proposing an attention mechanism to produce attentional uncertainty and to integrate the uncertainty throughout the segmentation network. The attentional mechanism seeks to determine regions where the uncertainty and the prediction error are not correlated. To improve the prediction results, an uncertainty-error based mechanism is introduced during training and employed an attentional kernel to put more weights on regions where the uncertainty and error are not correlated. For experiments, the Cityscapes dataset has been used to show the performance of the methods disclosed on a safety critical application.

It will be appreciated by those skilled in the art that at least one or more embodiments of the present technology aim to expand a range of technical solutions for addressing a particular technical problem, namely improving the robustness of a machine learning algorithm in the context of semantic segmentation of images, which may be used in applications such as, but not limited, to autonomous driving, medical imaging, and object recognition in general, which may in turn save computational resources by improving the accuracy of the detection.

From a certain perspective, one or more embodiments of the present technology can be summarized as follows, structured in numbered clauses:

Clause 1. A method (100,200) for training a neural network (460), the method (100,200) being executed by a processor (302, 316), the method (100,200) comprising: obtaining a neural network (460); generating a training dataset, the generating comprising: obtaining a segmented dataset comprising a plurality of multimodal data, obtaining an uncertainty map for each segmented multimodal data of the segmented dataset, the uncertainty map being indicative of a performance of a corresponding segmentation of the neural network (460), obtaining a segmentation error map for each segmented multimodal data of the segmented dataset, and combining each segmented multimodal data of the segmented dataset with a corresponding uncertainty map and a corresponding segmentation error map so as to provide the training dataset, the training dataset comprising a plurality of multimodal data wherein each multimodal data is segmented using the corresponding uncertainty map and the corresponding segmentation error map; training the neural network (460) using the training dataset to thereby obtain a trained neural network (460); and providing the trained neural network (460).

Clause 2. The method (100,200) as claimed in clause 1 , wherein

said training of the neural network (460) using the training dataset to thereby obtain the trained neural network (460) is a second training; and wherein the method (100,200) further comprises prior to said obtaining of the segmented dataset: obtaining the plurality of multimodal data, each multimodal data comprising a respective segmentation target; and first training the neural network (460) to perform segmentation on the plurality of multimodal data using the respective segmentation targets to obtain the segmented dataset comprising the plurality of segmented multimodal data. Clause 3. The method (100,200) as claimed in clause 2, further comprising: generating the uncertainty map using the respective segmentation targets.

Clause 4. The method (100,200) as claimed in clause 2 or 3, wherein said first training comprises obtaining a first set of weights of the neural network (460); and wherein said second training comprises obtaining a second set of weights of the neural network (460) to obtain the trained neural network (460);

Clause 5. The method (100,200) as claimed in clause 4, further comprising generating the segmentation error map for each segmented multimodal data using the performed segmentation with the first set of weights and the respective segmentation target.

Clause 6. The method (100,200) as claimed in any one of clauses 1 to 5, wherein said training of the neural network (460) using the training dataset to thereby obtain the trained neural network (460) comprises: using an attentional kernel on each multimodal data, the attentional kernel being indicative of a mismatch between the corresponding uncertainty map and the corresponding segmentation error map.

Clause ?. The method (100,200) as claimed in clause 6, wherein said using the attentional kernel comprises using a mismatch region between the corresponding uncertainty map and the corresponding segmentation error map of the multimodal data as a weight.

Clause 8. The method (100,200) as claimed in clause 7, wherein said using of the mismatch region is in response to determining the mismatch region between the corresponding uncertainty map and the corresponding segmentation error map based on a threshold.

Clause 9. The method (100,200) as claimed in any one of clauses 1 to 8, wherein said generating of the uncertainty map comprises using Bayesian predictive entropy.

Clause 10. The method (100,200) as claimed in any one of clauses 1 to 9, wherein said uncertainty map is a calibrated uncertainty map.

Clause 1 1. The method (100,200) as claimed in any one of clauses 1 to 10, wherein the multimodal data comprises image data; and wherein the neural network (460) comprises a semantic segmentation network.

Clause 12. The method (100,200) as claimed in any one of clauses 1 to 1 1 , further comprising: obtaining further multimodal data, the neural network (460) not having been trained on the further multimodal data; and providing, using the trained neural network (460), a segmentation prediction for the further multimodal data.

Clause 13. A method (100,200) for estimating a robustness of a neural network (460) in generating segmentation predictions, the method (100,200) being executed by a processor (302, 316), the method (100,200) comprising: obtaining the neural network (460); obtaining multimodal data associated with a target label; generating, by the neural network (460), a segmentation prediction for the multimodal data; determining, using the target label and the segmentation prediction, a prediction error; obtaining an uncertainty associated with the neural network (460); obtaining an attentive network; determining, by the attentive network, using the prediction error and the uncertainty parameter, an attentive uncertainty indicative of a robustness of the neural network (460) in generating segmentation predictions.

Clause 14. The method (100,200) as claimed in clause 13, wherein the obtaining of the uncertainty comprises obtaining an uncertainty map; and wherein the obtaining of the prediction error comprises obtaining of a prediction error map.

Clause 15. The method (100,200) as claimed in clause 14, wherein said obtaining of the uncertainty map comprises generating the uncertainty map using Bayesian predictive entropy.

Clause 16. The method (100,200) as claimed in clause 14, wherein said obtaining of the uncertainty map comprises generating the uncertainty map using an entropy of a softmax distribution. Clause 17. The method (100,200) as claimed in any one of clauses 13 to 16, wherein the attentive uncertainty is simultaneously indicative of an uncertainty of the neural network (460) in generating predictions and an error of the neural network (460) in generating predictions.

Clause 18. The method (100,200) as claimed in any one of clauses 13 to 17, further comprising, prior to said determining of the attentive uncertainty: training the attentive network to determine attentive uncertainties using a cross-entropy loss function.

Clause 19. The method (100,200) as claimed in any one of clauses 13 to 18, wherein the attentive network comprises an attention layer, and at least one fully-connected layer for processing the uncertainty map and the prediction error map to obtain the attentive uncertainty map.

Clause 20. The method (100,200) as claimed in clause 19, wherein the determining of the attentive uncertainty map comprises determining, by the attention layer, an attention map by masking pixels having a prediction error above zero.

Clause 21. The method (100,200) as claimed in clause 20, wherein the determining of the attentive uncertainty comprises using a sigmoid activation function on the attention map to obtain the attentive uncertainty.

Clause 22. A processing device (300) comprising: a processor (302, 316); and a non-transitory storage medium (312, 318) connected to the processor (302, 316), the non-transitory storage medium (312, 318) comprising computer-executable instructions, the processor (302, 316), upon executing the computer-executable instructions, being configured for: obtaining a neural network (460); generating a training dataset, the generating comprising: obtaining a segmented dataset comprising a plurality of multimodal data, obtaining an uncertainty map for each segmented multimodal data of the segmented dataset, the uncertainty map being indicative of a performance of a corresponding segmentation of the neural network (460), obtaining a segmentation error map for each segmented multimodal data of the segmented dataset, and combining each segmented multimodal data of the segmented dataset with a corresponding uncertainty map and a corresponding segmentation error map so as to provide the training dataset, the training dataset comprising a plurality of multimodal data wherein each multimodal data is segmented using the corresponding uncertainty map and the corresponding segmentation error map; training the neural network (460) using the training dataset to thereby obtain a trained neural network (460); and providing the trained neural network (460). Clause 23. The processing device (300) as claimed in clause 22, wherein said training of the neural network (460) using the training dataset to thereby obtain the trained neural network (460) is a second training; and wherein the processing device (300) is further configured for comprises prior to said obtaining of the segmented dataset: obtaining the plurality of multimodal data, each multimodal data comprising a respective segmentation target; and first training the neural network (460) to perform segmentation on the plurality of multimodal data using the respective segmentation targets to obtain the segmented dataset comprising the plurality of segmented multimodal data.

Clause 24. The processing device (300) as claimed in clause 23, wherein the processor (302, 316) is further configured for: generating the uncertainty map using the respective segmentation targets.

Clause 25. The processing device (300) as claimed in clause 23 or 24, wherein said first training comprises obtaining a first set of weights of the neural network (460); and wherein said second training comprises obtaining a second set of weights of the neural network (460) to obtain the trained neural network (460); Clause 26. The processing device (300) as claimed in clause 25, wherein the processor (302, 316) is further configured for generating the segmentation error map for each segmented multimodal data using the performed segmentation with the first set of weights and the respective segmentation target.

Clause 27. The processing device (300) as claimed in any one of clauses 22 to 26, wherein said second training of the neural network (460) using the training dataset to thereby obtain the trained neural network (460) comprises: using an attentional kernel on each multimodal data, the attentional kernel being indicative of a mismatch between the corresponding uncertainty map and the corresponding segmentation error map.

Clause 28. The processing device (300) as claimed in clause 27, wherein said using the attentional kernel comprises using a mismatch region between the corresponding uncertainty map and the corresponding segmentation error map of the multimodal data as a weight.

Clause 29. The processing device (300) as claimed in clause 28, wherein said using of the mismatch region is in response to determining the mismatch region between the corresponding uncertainty map and the corresponding segmentation error map based on a threshold.

Clause 30. The processing device (300) as claimed in any one of clauses 22 to 29, wherein said generating of the uncertainty map comprises using Bayesian predictive entropy.

Clause 31. The processing device (300) as claimed in any one of clauses 22 to 30, wherein said uncertainty map is a calibrated uncertainty map.

Clause 32. The processing device (300) as claimed in any one of clauses 22 to 31 , wherein the multimodal data comprises image data; and wherein the neural network (460) comprises a semantic segmentation network.

Clause 33. The processing device (300) as claimed in any one of clauses 22 to 32, wherein the processor (302, 316) is further configured for: obtaining further multimodal data, the neural network (460) not having been trained on the further multimodal data; and providing, using the trained neural network (460), a segmentation prediction for the further multimodal data.

Clause 34. A processing device (300) comprising: a processor (302, 316); and a non-transitory storage medium (312, 318) connected to the processor (302, 316), the non-transitory storage medium (312, 318) comprising computer-executable instructions, the processor (302, 316), upon executing the computer-executable instructions, being configured for: obtaining the neural network (460); obtaining multimodal data associated with a target label; generating, by the neural network (460), a segmentation prediction for the multimodal data; determining, using the target label and the segmentation prediction, a prediction error; obtaining an uncertainty associated with the neural network (460); obtaining an attentive network; determining, by the attentive network, using the prediction error and the uncertainty parameter, an attentive uncertainty indicative of a robustness of the neural network (460) in generating segmentation predictions.

Clause 35. The processing device (300) as claimed in clause 34, wherein the obtaining of the uncertainty comprises obtaining an uncertainty map; and wherein the obtaining of the prediction error comprises obtaining of a prediction error map.

Clause 36. The processing device (300) as claimed in clause 35, wherein said obtaining of the uncertainty map comprises generating the uncertainty map using Bayesian predictive entropy.

Clause 37. The processing device (300) as claimed in clause 35, wherein said obtaining of the uncertainty map comprises generating the uncertainty map using an entropy of a softmax distribution.

Clause 38. The processing device (300) as claimed in any one of clauses 34 to 37, wherein the attentive uncertainty is simultaneously indicative of an uncertainty of the neural network (460) in generating predictions and an error of the neural network (460) in generating predictions.

Clause 39. The processing device (300) as claimed in any one of clauses 34 to 38, wherein the processor (302, 316) is further configured for, prior to said determining of the attentive uncertainty: training the attentive network to determine attentive uncertainties using a cross-entropy loss function.

Clause 40. The processing device (300) as claimed in any one of clauses 34 to 39, wherein the attentive network comprises an attention layer, and at least one fully-connected layer for processing the uncertainty map and the prediction error map to obtain the attentive uncertainty map.

Clause 41. The processing device (300) as claimed in clause 40, wherein the determining of the attentive uncertainty map comprises determining, by the attention layer, an attention map by masking pixels having a prediction error above zero.

Clause 42. The processing device (300) as claimed in clause 41 , wherein the determining of the attentive uncertainty comprises using a sigmoid activation function on the attention map to obtain the attentive uncertainty.

Clause 43. A use of the trained neural network (460) as claimed in any one of clauses 1 to 21 for segmenting images obtained by a camera of an autonomous vehicle, and determining, using the segmented images, an appropriate action for the autonomous vehicle.

Clause 44. A use of the trained neural network (460) as claimed in any one of clauses 1 to 21 , for segmenting images obtained from a medical imaging apparatus, and providing the segmented images for displaying on a display device. Clause 45. A use of the trained neural network (460) as claimed in any one of clauses 22 to 42, for segmenting images obtained by a camera of an autonomous vehicle, and determining, using the segmented images, an appropriate action for the autonomous vehicle.

Clause 46. A use of the trained neural network (460) as claimed in any one of clauses 22 to 42, for segmenting images obtained from a medical imaging apparatus, and providing the segmented images for displaying on a display device.

Clause 47. A non-transitory computer readable storage medium (312, 318) for storing computer-executable instructions which, when executed, cause a computer (300) to perform: obtaining a neural network (460); generating a training dataset, the generating comprising: obtaining a segmented dataset comprising a plurality of multimodal data, obtaining an uncertainty map for each segmented multimodal data of the segmented dataset, the uncertainty map being indicative of a performance of a corresponding segmentation of the neural network (460), obtaining a segmentation error map for each segmented multimodal data of the segmented dataset, and combining each segmented multimodal data of the segmented dataset with a corresponding uncertainty map and a corresponding segmentation error map so as to provide the training dataset, the training dataset comprising a plurality of multimodal data wherein each multimodal data is segmented using the corresponding uncertainty map and the corresponding segmentation error map; training the neural network (460) using the training dataset to thereby obtain a trained neural network (460); and providing the trained neural network (460).

References

[1] Anurag Arnab, Ondrej Miksik, and Philip H.S. Torr. On the robustness of semantic segmentation models to adversarial attacks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[2] Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV (1 ), pages 44-57, 2008.

[3] Edward Choi, Cao Xiao, Walter F. Stewart, and Jimeng Sun. Mime: Multilevel medical embedding of electronic health records for predictive healthcare. In Advances in Neural Information Processing Systems 31 : Annual Conference on Neural Information Processing Systems 2018, NeurlPS 2018, 3-8 December 2018, Montr6al, Canada., pages 4552-4562, 2018.

[4] Robert L. Devaney. An introduction to chaotic dynamical systems. 1948.

[5] Terrance DeVries and Graham W. Taylor. Learning confidence for out-of- distribution detection in neural networks. CoRR, abs/1802.04865, 2018.

[6] Terrance DeVries and Graham W. Taylor. Leveraging uncertainty estimates for predicting segmentation quality. CoRR, 2018. [7] Y. Gal. Uncertainty in Deep Learning. PhD thesis, University of Cambridge,

2016.

[8] Jochen Gast and Stefan Roth. Lightweight probabilistic deep networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, June 18-22, 2018, pages 3369-3378.

[9] Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. In 3rd International Conference on Learning Representation ICLR, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings.

[10] Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, and Màté Lengyel. Bayesian active learning for classification and preference learning. CoRR, 201 1.

[1 1] Po-Yu Huang, Wan Ting Hsu, Chun-Yueh Chiu, Ting-Fan Wu, and Min Sun. Efficient uncertainty estimation for semantic segmentation in videos. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, pages 536-552.

[12] Elmer H. Johnson. Elementary applied statistics: For students in behavioral science. Social Forces, 44(3):455-456, 1966.

[13] Michael Kampffmeyer, Arnt-Berre Salberg, and Robert Jenssen. Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. 07 2016.

[14] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. In British Machine Vision Conference BMVC, London, UK, September 4-7, 2017.

[15] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems 30, pages 5574-5584. 2017. [16] Shu Kong and Charless C. Fowlkes. Pixel-wise attentional gating for parsimonious pixel labeling. CoRR, abs/1805.01556, 2018.

[17] Ivan Kreso, Sinisa Segvic, and Josip Krapac. Ladder-style densenets for semantic segmentation of large natural images. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Oct 2017.

[18] Yuri A. Kuznetsov. Elements of Applied Bifurcation Theory (2Nd Ed.). 1998.

[19] Panagiotis Meletis and Gijs Dubbelman. Training of convolutional networks on multiple heterogeneous datasets for street scene semantic segmentation. CoRR, abs/1803.05675, 2018.

[20] J. Mukhoti and Y. Gal. Evaluating Bayesian Deep Learning Methods for Semantic Segmentation. arXiv e-prints, 2018.

[21] Yu-Shao Peng, Kai-Fu Tang, Hsuan-Tien Lin, and Edward Y. Chang. REFUEL: exploring sparse features in deep reinforcement learning for fast disease diagnosis. In Advances in Neural Information Processing Systems 31 : Annual Conference on Neural Information Processing Systems 2018, NeurlPS 2018, 3-8 December 2018, Montreal, Canada., pages 7333-7342, 2018.

[22] Samuel Rota Bulo, Lorenzo Porzi, and Peter Kontschieder. In-place activated batchnorm for memory-optimized training of dnns. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[23] Andras Rozsa, Manuel Gunther, and Terrance E. Boult. Towards robust deep neural networks with BANG. In IEEE Winter Conference on Applications of Computer Vision WACV, Lake Tahoe, NV, USA, March 12-15, 2018, pages 803-81 1 , 2018.

[24] Claude Elwood Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379-423, 1948. [25] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):640— 651 , 2017.

[26] S.H. Strogatz. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering. 2014.

[27] Ke Sun, Zhanxing Zhu, and Zhouchen Lin. Enhancing the robustness of deep neural networks by boundary conditional GAN. CoRR, abs/1902.1 1029, 2019.

Claims

CLAIMS:

1. A method for training a neural network, the method being executed by a processor, the method comprising: obtaining a neural network; generating a training dataset, the generating comprising: obtaining a segmented dataset comprising a plurality of multimodal data, obtaining an uncertainty map for each segmented multimodal data of the segmented dataset, the uncertainty map being indicative of a performance of a corresponding segmentation of the neural network, obtaining a segmentation error map for each segmented multimodal data of the segmented dataset, and combining each segmented multimodal data of the segmented dataset with a corresponding uncertainty map and a corresponding segmentation error map so as to provide the training dataset, the training dataset comprising a plurality of multimodal data wherein each multimodal data is segmented using the corresponding uncertainty map and the corresponding segmentation error map; training the neural network using the training dataset to thereby obtain a trained neural network; and providing the trained neural network.

2. The method of claim 1 , wherein said training of the neural network using the training dataset to thereby obtain the trained neural network is a second training; and wherein the method further comprises prior to said obtaining of the segmented dataset: obtaining the plurality of multimodal data, each multimodal data comprising a respective segmentation target; and first training the neural network to perform segmentation on the plurality of multimodal data using the respective segmentation targets to obtain the segmented dataset comprising the plurality of segmented multimodal data.

3. The method of claim 2, further comprising: generating the uncertainty map using the respective segmentation targets.

4. The method of claim 2 or 3, wherein said first training comprises obtaining a first set of weights of the neural network; and wherein said second training comprises obtaining a second set of weights of the neural network to obtain the trained neural network.

5. The method of claim 4, further comprising generating the segmentation error map for each segmented multimodal data using the performed segmentation with the first set of weights and the respective segmentation target.

6. The method of any one of claims 1 to 5, wherein said training of the neural network using the training dataset to thereby obtain the trained neural network comprises: using an attentional kernel on each multimodal data, the attentional kernel being indicative of a mismatch between the corresponding uncertainty map and the corresponding segmentation error map.

7. The method of claim 6, wherein said using the attentional kernel comprises using a mismatch region between the corresponding uncertainty map and the corresponding segmentation error map of the multimodal data as a weight.

8. The method of claim 7, wherein said using of the mismatch region is in response to determining the mismatch region between the corresponding uncertainty map and the corresponding segmentation error map based on a threshold.

9. The method of any one of claims 1 to 8, wherein said generating of the uncertainty map comprises using Bayesian predictive entropy.

10. The method of any one of claims 1 to 9, wherein said uncertainty map is a calibrated uncertainty map.

11. The method of any one of claims 1 to 10, wherein the multimodal data comprises image data; and wherein the neural network comprises a semantic segmentation network.

12. The method of any one of claims 1 to 11 , further comprising: obtaining further multimodal data, the neural network not having been trained on the further multimodal data; and providing, using the trained neural network, a segmentation prediction for the further multimodal data.

13. A method for estimating a robustness of a neural network in generating segmentation predictions, the method being executed by a processor, the method comprising: obtaining the neural network; obtaining multimodal data associated with a target label; generating, by the neural network, a segmentation prediction for the multimodal data; determining, using the target label and the segmentation prediction, a prediction error; obtaining an uncertainty associated with the neural network; obtaining an attentive network; and determining, by the attentive network, using the prediction error and the uncertainty, an attentive uncertainty indicative of a robustness of the neural network in generating segmentation predictions.

14. The method of claim 13, wherein the obtaining of the uncertainty comprises obtaining an uncertainty map; and wherein the obtaining of the prediction error comprises obtaining of a prediction error map.

15. The method of claim 14, wherein said obtaining of the uncertainty comprises generating the uncertainty map using Bayesian predictive entropy.

16. The method of claim 14, wherein said obtaining of the uncertainty map comprises generating the uncertainty map using an entropy of a softmax distribution.

17. The method of any one of claims 13 to 16, wherein the attentive uncertainty is simultaneously indicative of an uncertainty of the neural network in generating predictions and an error of the neural network in generating predictions.

18. The method of any one of claims 13 to 17, further comprising, prior to said determining of the attentive uncertainty: training the attentive network to determine attentive uncertainties using a cross-entropy loss function.

19. The method of any one of claims 13 to 18, wherein the attentive network comprises an attention layer, and at least one fully-connected layer for processing the uncertainty map and the prediction error map to obtain the attentive uncertainty map.

20. The method of claim 19, wherein the determining of the attentive uncertainty map comprises determining, by the attention layer, an attention map by masking pixels having a prediction error above zero.

21. The method of claim 20, wherein the determining of the attentive uncertainty comprises using a sigmoid activation function on the attention map to obtain the attentive uncertainty.

22. A processing device comprising: a processor; and a non-transitory storage medium connected to the processor, the non- transitory storage medium comprising computer-executable instructions, the processor, upon executing the computer-executable instructions, being configured for obtaining a neural network; generating a training dataset, the generating comprising: obtaining a segmented dataset comprising a plurality of multimodal data, obtaining an uncertainty map for each segmented multimodal data of the segmented dataset, the uncertainty map being indicative of a performance of a corresponding segmentation of the neural network, obtaining a segmentation error map for each segmented multimodal data of the segmented dataset, and combining each segmented multimodal data of the segmented dataset with a corresponding uncertainty map and a corresponding segmentation error map so as to provide the training dataset, the training dataset comprising a plurality of multimodal data wherein each multimodal data is segmented using the corresponding uncertainty map and the corresponding segmentation error map; training the neural network using the training dataset to thereby obtain a trained neural network; and providing the trained neural network.

23. The processing device of claim 22, wherein said training of the neural network using the training dataset to thereby obtain the trained neural network is a second training; and wherein the processing device is further configured for, prior to said obtaining of the segmented dataset: obtaining the plurality of multimodal data, each multimodal data comprising a respective segmentation target; and first training the neural network to perform segmentation on the plurality of multimodal data using the respective segmentation targets to obtain the segmented dataset comprising the plurality of segmented multimodal data.

24. The processing device of claim 23, wherein the processor is further configured for generating the uncertainty map using the respective segmentation targets.

25. The processing device of claim 23 or 24, wherein said first training comprises obtaining a first set of weights of the neural network; and wherein said second training comprises obtaining a second set of weights of the neural network to obtain the trained neural network.

26. The processing device of claim 25, wherein the processor is further configured for generating the segmentation error map for each segmented multimodal data using the performed segmentation with the first set of weights and the respective segmentation target.

27. The processing device of any one of daims 23 to 26, wherein said second training of the neural network using the training dataset to thereby obtain the trained neural network comprises: using an attentional kernel on each multimodal data, the attentional kernel being indicative of a mismatch between the corresponding uncertainty map and the corresponding segmentation error map.

28. The processing device of daim 27, wherein said using the attentional kernel comprises using a mismatch region between the corresponding uncertainty map and the corresponding segmentation error map of the multimodal data as a weight.

29. The processing device of claim 28, wherein said using of the mismatch region is in response to determining the mismatch region between the corresponding uncertainty map and the corresponding segmentation error map based on a threshold.

30. The processing device of any one of daims 22 to 29, wherein said generating of the uncertainty map comprises using Bayesian predictive entropy.

31. The processing device of any one of daims 22 to 30, wherein said uncertainty map is a calibrated uncertainty map.

32. The processing device of any one of daims 22 to 31 , wherein the multimodal data comprises image data; and wherein the neural network comprises a semantic segmentation network.

33. The processing device of any one of claims 22 to 32, wherein the processor is further configured for obtaining further multimodal data, the neural network not having been trained on the further multimodal data; and providing, using the trained neural network, a segmentation prediction for the further multimodal data.

34. A processing device comprising: a processor; and a non-transitory storage medium connected to the processor, the non-transitory storage medium comprising computer-executable instructions, the processor, upon executing the computer-executable instructions, being configured for obtaining the neural network; obtaining multimodal data associated with a target label; generating, by the neural network, a segmentation prediction for the multimodal data; determining, using the target label and the segmentation prediction, a prediction error; obtaining an uncertainty associated with the neural network; obtaining an attentive network; and determining, by the attentive network, using the prediction error and the uncertainty, an attentive uncertainty indicative of a robustness of the neural network in generating segmentation predictions.

35. The processing device of claim 34, wherein the obtaining of the uncertainty comprises obtaining an uncertainty map; and wherein the obtaining of the prediction error comprises obtaining of a prediction error map.

36. The processing device of claim 35, wherein said obtaining of the uncertainty map comprises generating the uncertainty map using Bayesian predictive entropy.

37. The processing device of claim 35, wherein said obtaining of the uncertainty map comprises generating the uncertainty map using an entropy of a softmax distribution.

38. The processing device of any one of claims 34 to 37, wherein the attentive uncertainty is simultaneously indicative of an uncertainty of the neural network in generating predictions and an error of the neural network in generating predictions.

39. The processing device of any one of claims 34 to 38, wherein the processor is further configured for, prior to said determining of the attentive uncertainty: training the attentive network to determine attentive uncertainties using a cross-entropy loss function.

40. The processing device of any one of daims 34 to 39, wherein the attentive network comprises an attention layer, and at least one fully-connected layer for processing the uncertainty map and the prediction error map to obtain the attentive uncertainty map.

41. The processing device of claim 40, wherein the determining of the attentive uncertainty map comprises determining, by the attention layer, an attention map by masking pixels having a prediction error above zero.

42. The processing device of daim 41 , wherein the determining of the attentive uncertainty comprises using a sigmoid activation function on the attention map to obtain the attentive uncertainty.

43. A non-transitory computer readable storage medium for storing computer- executable instructions which, when executed, cause a computer to perform: obtaining a neural network; generating a training dataset, the generating comprising: obtaining a segmented dataset comprising a plurality of multimodal data, obtaining an uncertainty map for each segmented multimodal data of the segmented dataset, the uncertainty map being indicative of a performance of a corresponding segmentation of the neural network, obtaining a segmentation error map for each segmented multimodal data of the segmented dataset, and combining each segmented multimodal data of the segmented dataset with a corresponding uncertainty map and a corresponding segmentation error map so as to provide the training dataset, the training dataset comprising a plurality of multimodal data wherein each multimodal data is segmented using the corresponding uncertainty map and the corresponding segmentation error map; training the neural network using the training dataset to thereby obtain a trained neural network; and providing the trained neural network.