EP4341913A2

EP4341913A2 - System for detection and management of uncertainty in perception systems, for new object detection and for situation anticipation

Info

Publication number: EP4341913A2
Application number: EP22730378.1A
Authority: EP
Inventors: Ralph Meyfarth; Sven Fuelster
Original assignee: Deep Safety GmbH
Current assignee: Deep Safety GmbH
Priority date: 2021-05-17
Filing date: 2022-05-17
Publication date: 2024-03-27
Also published as: WO2022243337A3; WO2022243337A2; WO2022243337A9

Abstract

According to the invention, a perception system is provided that comprises a segmenting neural network and an uncertainty detector. The segmenting neural network is configured and trained for segmentation of an input image pixel matrix to thus generate a segment composed of elements that correspond the pixels of the input image pixel matrix. By way of class prediction, each element of the segment map is assigned to one of a plurality of object classes the segmenting neural network is trained for by way of class prediction. Elements being assigned to the same object class form a segment of the segment map. The uncertainty detector is configured to generate an uncertainty score map that is composed of elements that correspond the pixels of the input image pixel matrix. Each element of the uncertainty map has an uncertainty score that is determined by the uncertainty detector and that reflects an amount of uncertainty involved in a class prediction for a corresponding element in the segment map.

Description

System for detection and management of uncertainty in perception systems, for new object detection and for situation anticipation

The invention refers to a system for detection and management of uncertainty in perception systems. In further embodiments the invention refers to a system for new object detection and/or for situation anticipation based on detected uncertainty.

One particular field of application of machine learning or deep learning is computer vision that can for instance be used in perception systems. In computer vision, deep neural networks learn to recognize classes of objects in images. Computer vision includes simple classification, in which a whole image is classified with the object class of its most dominant object; object detection, in which a neural network predicts bounding boxes around the ob- jects in the image; semantic segmentation, in which every pixel of an image is classified (labeled) with the object class of the object to which it belongs; optical flow, which predicts a field of vectors of movement for the objects shown; as well as derived methods like instance segmentation, panoptic segmentation, video segmentation and more. For all these methods, the neural network needs to be trained by means of a training dataset in order to learn classes of objects or features to predict. For supervised learning, this training dataset comprises pairs of an input image and a corre- sponding label with the desired result (i.e. the object class represented in the input image). A label can be a simple class name for the whole image in the case of classification; or a pixel-wise hand-labelled image in the case of semantic segmentation. Herein, we will most often refer to the case of semantic segmentation. However, the methods and techniques claimed shall be applicable to all the various computer vision techniques and all the various architectures of neural networks associated with these techniques.

The invention relates in particular to semantic segmentation of images by means of a segmenting neural network. Images are typically composed of pixels that represent a picture taken by a camera with a lens that projects an image on an image sensor. The image sensor converts the projected picture into a pixel matrix that represents the picture. A pic- ture or image can for instance be a frame of a video stream.

The pixel matrix representing an image can be fed to an input layer of a segmenting neural network and thus be used as an input image pixel matrix. Each input image pixel matrix represents a sample to be processed by the segmenting neural network.

Semantic segmentation of an image represented by a pixel matrix serves to assign regions of the image - or more precise: each pixel in a respective segment of an image - to recognized objects. For recognizing objects in an image, typically convolutional neural networks (CNN) are used, for instance fully convolutional neural networks (FCN). Those convolutional neural networks are trained as multi-class classifiers that can detect objects they are trained for in an image. A fully convolutional neural network used for semantic segmenta- tion typically comprises a plurality of convolutional layers for detecting features (i.e. occurrence of objects the CNN is trained for) and pooling layers for down-sampling the output of the convolutional layers in a certain stage. Layers are composed of nodes. Nodes have an input and an output. The input of a node can be connected to some or all outputs of nodes in an anteceding layer thus receiving output values from the outputs of nodes of the ante- ceding layer. The values a node receives via its inputs are weighted and the weighted inputs are summed up to thus form a weighted sum. The weighted sum is transformed by an activation function of the node into the output of that node. The structure defined by the layers of a neural network - its topology or architecture - is defined prior to training of a neural network

During training of a neural network the weights in the nodes of the layers of the neural network are modified until the neural network provides the desired or expected prediction. The neural network thus learns to predict (e.g. recognize) classes of objects or features it is trained for. The parameters created during training of the neural network, in particular the weights for the inputs of the nodes, form a model. Thus, a trained neural network implements a model.

Regarding the convolutional layers, it is known to the skilled person, that these use filter kernel arrays that have a much smaller dimension than the pixel matrix representing the image. The filter kernel arrays are composed of array elements having weight values. A filter kernel array is stepwise moved over the image pixel matrix and each value of a pixel of the image pixel matrix is element wise multiplied with the weight value of the respective element of the filter kernel matrix while the filter kernel matrix “moves over” the image pixel matrix, thus convolving the image pixel matrix. Typically, a plurality of filter kernel arrays are used to extract different low level features.

The convoluted output from such convolution in a convolutional layer may be fed as input to a next convolutional layer and again be convoluted by means of filter kernel arrays. The convoluted output is called a feature map. For instance, for each color channel of a color image, a feature map can be created.

Eventually the convoluted output of a convolution layer can be rectified by means of a nonlinear function, for instance a ReLU (Rectified Linear Unit) operator. The ReLU operator is a preferred activation function of the nodes of the convolutional layer, which eliminates all negative values from the output of the convolutional layer. The purpose of the non-linear function is that a network without it could only learn linear relations, because the convolution operation itself is linear, s(t) = (x*w)(t) for learned weights w and input x; cf. https://www.deeplearninqbook.org/contents/convnets.html. Eq. 9.2. The non-linear function enables the network to learn non-linear relations. The elimination of negative values is only a side effect of the ReLU, which is not necessary for learning, and - on the contrary - even undesired. A “Leaky ReLU”, for instance, does not have this side effect. Most architectures choose ReLU due to its simplicity, ReLU(z) = max(0, z); cf. https ://www.deeplearn- inqbook.org/contents/mlp.html. Eq. 6.37. The ReLU operators are part of the respective convolutional layer. In order to reduce the dimension of the feature map by way of down-sampling, pooling layers are used. One way of down-sampling is called Max-Pooling. When using Max-Pooling, for instance each 2*2 sub-array of the feature map is replaced by a single value that corresponds to the maximum value of the four elements of the 2*2 sub-array. The down-sampled feature map can again be processed in a convolutional layer. Eventually, the feature map from the final convolutional layer or Pooling-layer is a score map.

For each object class, the neural network is trained for feature score maps and a segment map is generated. If the neural network is for instance trained for five object classes (e.g. cars, traffic signs, human beings, street markings street borders), then five feature score maps are generated.

The scores in one feature score map represent for one object class the fully convolutional network is trained for representing the likelihood that an object of this object class, the fully convolutional network is trained for is represented in the input image matrix. Thus, the objects represented in an input image pixel matrix are "recognized" and a feature score map is generated for each object class wherein for each (down-sampled) pixel a score is formed that indicates the likelihood that a pixel represents an object of the object class. The scores correspond to activation levels of elements of the feature score map. The scores of the elements of the feature score maps can be compared with each other in order to assign each element or pixel, respectively to one of the known object classes. Thus, a segment map can be generated wherein the elements are labeled with labels indicating to which of all the objects the segmenting neural network is trained for an individual pixel may belong to.

If, for instance, the neural network is trained for five objects, five feature score maps are generated. By means of a Softmax function, the scores in each feature score map can be normalized on a scale between 0 and 1 ; cf. https://www.deeplearninqbook.org/con- tents/mlp.html. Eg. 6.29. The normalized scores for corresponding elements of the feature score maps can be compared and the element can be assigned to that object class that corresponds to the score map having the highest score for the respective element. This is done by means of a maximum likelihood estimator using an Argmax function; cf. https://www.deeplearninqbook.org/contents/ml.html. Eg. 5.56. Thus, as segment map can be generated from the feature score maps by way of comparing the scores for corresponding elements in the feature score maps. The score for each element of a feature score map for a respective object class represents an activation level of that element. The higher the activation level - and thus, the higher the score - the higher would be the chance that the element of the score map corresponds to a pixel (or a number of pixels) of the input image pixel matrix that belongs to the respective object, e.g. is a pixel of a car in the image, if the score map is for the object class "car".

The fully convolutional network (FCN) serves as an encoder that can detect objects in an image matrix and produces a low-resolution tensor containing high-level information, i.e. complex features.

The final feature score maps are up-sampled to assign to each pixel of the input pixel matrix a score that indicates the likelihood that a pixel represents an object of a known object class. Such up-sampling can be achieved by means of bilinear interpolation.

Alternatively, up-sampling can be achieved by a decoder.

For semantic segmentation, all architectures are encoder decoder architectures, where down-sampling and up-sampling mean the process of learning simple, more abstract fea- tures from pixels in the first convolutional layer, learning complex features from simple features in the next layer, and so on, and learning the same process vice versa in the decoder. The up-sampling by means of bilinear interpolation is done because the size of the feature score maps of the last layer is not necessarily the same as the size of the input image, and - at least in training - the size of the output needs to match the size of the labels which have the same size as the input.

For semantic segmentation, a segment map is generated from the feature score maps. To each element of the segment map a label is assigned that indicates for which of the known object classes the highest activation level is found in the feature score maps.

If the convolutional neural network is not used for semantic segmentation of images but for merely identifying the presence of an object in an image (i.e. simple classification), the output of a final convolutional layer, ReLU-layer or pooling-layer can be fed into a fully connected layer that produces an output vector wherein the values of the vector elements represent a score that indicates the likelihood that a respective object is present in the analyzed image. The output vector of such classifying CNN is thus a feature vector. A neu- ral network for semantic segmentation (further on called segmenting neural network) does not have such fully connected layer because this would destroy the information about the location of objects in the input image matrix.

In any case, the neural network needs to be trained by means of training data sets comprising image data and labels that indicate what is represented by the image data (ground truth). The image data represent the input image matrix and the labels represent the desired output. In a back propagation process, the weights and the filter kernel arrays are iteratively adapted until the difference between the actual output of the CNN and the desired output is minimized. The difference between the actual output of the CNN and the desired output is calculated by means of a loss function. From a training dataset, containing pairs of an input image and a ground truth image that consists of the correct class labels, the neural network computes the class predictions for each pixel in the output image. In training, a loss function compares the input class labels with the predictions made by the neural network and then pushes the parameters - i.e. the weights in the nodes of the layers - of the neural network in a direction that would have resulted in a better prediction. When this is done again and again with lots of pairs of input images and ground truth images, the neural network will learn the abstract concepts of the given classes.

As indicated above, semantic segmentation results in assigning pixels of an image to known object, i.e. objects, the neural network was trained for. Mere semantic segmentation cannot discriminate between several instances of the same object in an image. If, for instance, in an image several cars are visible, all pixels belonging to a car are assigned to the object "car". The individual cars - i.e. instances of the object "car" - are not discriminated. Discriminating instances of objects of an object class requires instance segmentation. Pixels that that do not belong to an object that a neural network is trained for will nevertheless be assigned to one of the object classes the neural network was trained for, and the score might even be high. Even with low scores, the object class with the highest relative score would "win", i.e. the pixel will be assigned to the object class with the highest score.

When computer vision and semantic segmentation in particular is used in real life environ- ments, for instance in vehicles to promote autonomous driving, the uncertainty involved in computer vision needs to be taken into account. Alex Kendall et al. "Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder- Decoder Architectures for Scene Understanding" 10. October 2016; https://arxiv.Org/pdf/1511 .02680v2.pdf, present a deep learning framework for probabilistic pixel-wise semantic segmentation and a system which is able to predict pixel-wise class labels with a measure of model uncertainty.

It is an object of the invention to provide a system that can assess uncertainty in perception systems.

According to the invention, a perception system is provided that comprises a segmenting neural network and an uncertainty detector. The perception system can be part of a vehicle, for instance a car, to enable autonomous driving. In various embodiments, the perception system can also be part of various autonomous machines, for instance robots, cargo systems or other machines that are designed to operate at least in part autonomously.

The segmenting neural network is configured and trained for segmentation of an input im- age pixel matrix to thus generate a segment composed of elements that correspond the pixels of the input image pixel matrix. By way of class prediction, each element of the segment map is assigned to one of a plurality of object classes the segmenting neural network is trained for by way of class prediction. Elements being assigned to the same object class form a segment of the segment map. The uncertainty detector is configured to generate an uncertainty score map (also called "uncertainty map") that is composed of elements that correspond the pixels of the input image pixel matrix. Each element of the uncertainty map has an uncertainty score that is determined by the uncertainty detector and that reflects an amount of uncertainty involved in a class prediction for a corresponding element in the segment map. Preferably, the uncertainty detector is configured to access feature score maps generated by the segmenting neural network prior to generating the segment map from said feature score maps. The uncertainty detector is configured to determine the amount of uncertainty and thus the uncertainty score for each element of the uncertainty score map by determining a variance of the activation levels of elements of the feature score maps. Preferably, the uncertainty detector is configured to determine the amount of uncertainty and thus the uncertainty score for each element of the uncertainty score map by determining an inter-sample variance of the activation levels of an element of a feature score map in different samples of the feature score map. Preferably, the uncertainty detector is configured to generate inter-sample variances of the activation levels for an element of a feature score map in different samples of the feature score map by processing an input image pixel matrix with the segmenting neural network in multiple passes while for each pass the segmenting neural network is randomly modified.

Preferably, the random modification of the segmenting neural network includes at least one of dropping nodes from a convolutional layer of the segmenting neural network, altering activation functions in at least some nodes of at least some layers of the segmenting neural net-work and/or introducing noise in at least some nodes of at least some layers of the segmenting neural network.

Preferably, the uncertainty detector is configured to determine an inter-sample variance of the activation levels of corresponding elements of feature score maps that are generated form consecutive image pixel matrices corresponding to frames of a video sequence. Since in this case the samples are frames of a video sequence the inter-sample variances correspond to variances between frames and thus are also called "inter-frame variance" hereinafter. In a potential implementation of this embodiment, one pass of determining inter-sam- pie variance from n consecutive images will process images a..a+n, and then restart with frames a+n+1...a+2n. To increase the likelihood to not miss a yet unknown object o that appears somewhere within this sequence a...a+i...a+n, it is preferred run several such passes in parallel, each starting with an offset b, e.g. the second parallel pass at a+b...a+b+n, which then restarts at a+b+n+1...a+b+2n. In this second parallel pass, the object o will be closer to the beginning of the pass than in the first parallel pass, and the process will build a better variance with respect to the object o.

To improve the efficiency of the determination of inter-frame variance, it is possible to use only every second, third or n^th frame of a respective sequence, depending on the frame rate. Typical frame rates would be between 25 and 60 frames per second (fps). Preferably, the uncertainty detector is configured to determine the amount of uncertainty and thus the uncertainty score for each element of the uncertainty score map by determining inter-class variances between activation levels of corresponding elements of feature score maps for different object classes as provided by the segmenting neural network. Preferably, the uncertainty detector comprises a generative neural network, in particular a variational autoencoder that is trained for the same classes or objects, the segmenting neural network is trained for.

Preferably, the uncertainty detector is configured for detecting a region that is labeled in the uncertainty score map that is composed of elements having a high uncertainty score and labeling the region that is composed of elements having a high uncertainty score as a candidate for representing an object of a yet unknown object class.

High uncertainty scores can be uncertainty scores that are higher than an average uncertainty score of all elements of the uncertainty score map. High uncertainty scores can be uncertainty scores that are higher than a median of the uncertainty scores of all elements of the uncertainty score map. High uncertainty scores can be uncertainty scores that exceed an average uncertainty score or a median of the uncertainty scores of all elements of the uncertainty score map by a predetermined amount.

It is further preferred if the uncertainty detector is configured for - verifying that the region that is labeled as a candidate for representing an object of a yet unknown object class indeed represents an object of a yet unknown object class by determining the plausibility that a segment of the input image pixel matrix corresponding to the labeled region represents an unknown object.

Potential ways for determining such plausibility as disclosed in the section "Preferred wavs of checking a plausibility that a segment represents an unknown object" are:

Preferably, the perception system is configured to reconfigure the segmenting neural network by adding a model for a new object class for instance by way of introducing or configuring further layers with weights that represent the new object class, wherein the new object class includes the so far unknown object represented in the segment found to represent a so far unknown object.

Creating a new segmenting neural network that is configured for predicting new object classes (i.e. it incorporates a model for a new object class, said model being created by training of the neural network) or reconfiguring the existing segmenting neural network so as to enable the existing segmenting neural network to predict objects of an additional new object class can be part of a domain adaptation that extends an operation domain of the perception system.

Reconfiguring the existing segmenting neural network preferably includes the training of the segmenting neural network with input image pixel matrices representing the so far unknown object in combination with a label for the then new object class (ground truth for the so far unknown and then new object class). Labels can be generated automatically for a segment representing an unknown object.

Preferably, the existing segmenting neural network or preferably a second, similar com- panion neural network is trained with the newly determined yet unknown object class, either in the cloud by uploading the newly detected object class and after training down-loading the trained neural network or trained similar companion neural network, for instance right on the perception system on an autonomous vehicle.

The newly detected object class can for instance be directly shared with other autonomous vehicles, so each vehicle can train their segmenting neural network or similar companion neural network on a larger number of yet un-known object classes without having to depend on an update from the cloud.

Such training over multiple autonomous vehicles can be parallelized by means of implementing a distributed data parallel training (as explained in: PyTorch Distributed: Experi- ences on Accelerating Data Parallel Training, https://arxiv.org/pdf/2006.15704.pdf). Any such training will preferably be conducted by means of few-shot learning (as explained in: An Overview of Deep Learning Architectures in Few-Shot Learning Domain https://arxiv.org/pdf/2008.06365.pdf). The assignment of individual yet unknown objects to new object classes can be automated by their similarity, preferably determined by means of unsupervised segmentation (as explained in: Unsupervised Learning of Image Segmentation Based on Differentiable Feature Clustering, https://arxiv.org/pdf/2007.09990.pdf). A newly created model for a new object class can be associated automatically to already existing models for known object classes by similarity, by way of determining the similarity by means of one-shot Learning (as explained in: Fully Convolutional One-Shot Object Segmentation for Industrial Robotics, https://arxiv.org/pdf/1903.00683.pdf). in One-Shot-Learn- ing, the elements of the abstract feature vector of the neural network which conducts the One-Shot Learning have a semantic structure, e.g. the feature vector might consist of textual descriptions of the objects, in which e.g. the description of a rickshaw would be similar to the description of a car, and thus, the newly determined object class for a rickshaw can be treated as car-like by the autonomous vehicle without the need of manual association).

According to the invention, a method for semantic segmentation of input image pixel matri- ces by way of class prediction performed by a segmenting neural network and for determining an amount of uncertainty involved in the class predictions for each pixel of an input image pixel matrix is provided. The method comprises segmenting an input image pixel matrix by means of a segmenting neural network that is trained for a plurality of object classes and that generates for each object class a feature score map and therefrom a segment map for the input image pixel matrix by assigning elements of the segment map to one of a plurality of object classes the segmenting neural network is trained for by way of class prediction, elements being assigned to the same object class forming a segment of the segment map and generating an uncertainty score map (short: uncertainty map) composed of elements that correspond the pixels of the input image pixel matrix, each element of the uncertainty map having an uncertainty score that is determined by the uncertainty detector and that reflects an amount of uncertainty involved in a class prediction for a corresponding element in the segment map.

Preferably, the method further comprises - detecting a region in the uncertainty score map that is composed of elements having a high uncertainty score and labeling the region that is composed of elements having a high uncertainty score as candidates for representing an object of a yet unknown object class, a high uncertainty score being an uncertainty score that is higher than an average uncertainty score of all elements of the uncertainty score map. Preferably, the method further comprises creating a new object class if a region that is composed of elements having a high uncertainty score is detected, said new object class representing objects as shown in a region of the input image pixel matrix corresponding to the regions in the uncertainty score map that are composed of elements having a high uncertainty score. The method for generating new object classes thus may include: detecting a region in the uncertainty score map that is composed of elements having a high uncertainty score generating a tentative new object class and automatically generating a label and recognizing an existing object class or another new object class being similar to the tentative new object class for instance by means of few shot learning.

The method may comprise generating a tentative new object class in case an unknown object is detected. If a further unknown object is detected, the method comprises assigning the further unknown object to the existing tentative new object class or to a further new object class, depending on the similarity of the unknown objects. In case a further unknown object can be assigned to a previous unknown object (based on the similarity of the objects) a new object class (that is not tentative anymore) can be generated by one shot learning or few shot learning using samples that represent new objects that are similarto each other.

Alternatively, the method for generating new object classes may include generating feature score maps from input image pixel matrices (samples) captured at various instances in time;: detecting a region in the uncertainty score map that is composed of elements having a high uncertainty score, said region representing a first un-known object generating a tentative new object class for said unknown object and automatically generating a label and detecting a further region in a further uncertainty score map that is composed of elements having a high uncertainty score, said region representing a further un- known object determining a similarity between the first and the further and if the similarity exceeds a predetermined threshold, generating a non-tentative new object class.

Preferably, generating a non-tentative new object class includes one-shot learning or few shot learning using samples (i.e. in-put image pixel matrices) that represent the first unknown object and the further unknown object.

No explicit labels are need but only automatically generated labels for generating new object classes (for instance by means of a new object detector) and for transfer learning of new object classes (for instance as generated by the new object detector). A relevance of a new object class can be determined based on a user reaction in the context of encountering an unknown, potentially new object. This is explained in more detail further below. User reaction in the context of encountering an unknown, potentially new object can be determined based on optical flow, an inertia sensor (gyroscope) or from signals on the CAN bus of a vehicle. New object classes thus generated fully automatically can be used for training and thus updating the semantic segmentation model implemented by the segmenting neural network of the perception system.

A region that is composed of elements having a high uncertainty score can correspond to a part of a segment of the segment map. For instance, if a part of an input image pixel matrix represents a yet unknown object, the pixel of this part will be assigned to a known object class by the segmenting neural network. However, the pixel representing the unknown object (that does not correspond to any of the object classes the segmenting neural network was trained for) will typically exhibit a higher uncertainty score and thus the yet unknown object can be found by means of the uncertainty score map. Preferably, the uncertainty score is determined by determining an inter-class variance of the activation levels of elements of the feature score maps for different object classes.

Preferably, the uncertainty score is determined by determining an inter-sample variance of the activation levels of elements of different samples of a feature score map for one object class.

A further aspect of the invention is the use of the method and its embodiments as illustrated herein in an autonomous machine, in particular in an autonomous vehicle.

Accordingly, a system for detection and management of uncertainty in perception systems is provided, that comprises a data processing unit implementing a segmentation machine comprising a trained segmenting neural network for semantic segmentation of input image matrices. The neural network comprises convolutional layers and is configured to perform a semantic segmentation of an input image pixel matrix. The semantic segmentation is based on multiple object classes the neural network is trained for.

The system further comprises an uncertainty detector that is configured to determine an uncertainty score for each pixel, a group of pixels or a segment. The uncertainty score reflects how (un-) certain it is that a pixel of a segment represents the object the segment is assigned to by the segmenting neural network. A low level of certainty (i.e. a high level of uncertainty) can mean that pixel of a segment - and thus the segment - are wrongly assigned to an object. The reason may be that the segment represents an object (i.e. of a yet unknown object class) the neural network of the segmentation machine was not trained for, i.e. an unknown object or object class, respectively.

The uncertainty detector can be configured to generate an uncertainty score map wherein for each pixel of the segment map (and thus, typically, for each pixel of the input image pixel matrix) an uncertainty score value is assigned. Preferably, the dimension of the un- certainty score map matches the dimension of the segment map generated by the segmenting neural network. Then, each element of the uncertainty score map directly corresponds to an element of the segment map and - preferably - also to one pixel of the input image pixel matrix.

The uncertainty detector is connected to the segmenting neural network for semantic seg- mentation of input image matrices. Thus, the system comprises a segmenting neural network that generates a segment map for each input image pixel matrix and an uncertainty detector that generates an uncertainty score for each pixel of an input image pixel matrix and/or for each element of the segment map.

In alternative embodiments, the uncertainty detector can be configured to determine uncertainty from the feature score maps provided by the segmenting neural network for different object classes by way of analyzing the inter-class variance of the activation levels (feature scores) of the pixels of a seg- ment. In particular, the uncertainty detector can be configured to determine a softmax confidence. can be configured to cause variations in varying nodes of the layers of the segmenting neural network. The variations can be achieved by randomly dropping out nodes (Monte Carlo drop out) or by randomly modifying the activation functions of some nodes. An input image pixel matrix is processed by the segmenting neural network multiple times and each time the segmenting neural network is randomly modified thus providing varying feature score maps. Thus, a number of varying samples of the feature score map are generated. The inter sample variations of the feature score map samples depend on the influence of the variation of the nodes on the scores in the score maps. Processing an input image pixel matrix multiple times with the segmenting neural network while randomly modifying the segmenting neural network renders the segmenting neural network into a Bayesian neural network implementing variational inference. Preferably, instead of processing the same input image pixel matrix multiple times, multiple frames of a video sequences can be used as input image pixel matrices. Each frame of a video sequence is an input image pixel matrix.

If the frame rate of the video sequence is sufficiently high, the inter-frame variations (i.e. the variations from frame to frame) are sufficiently small because the recorded image does not change too much between the frames. For the processing of each individual frame (i.e. each individual image pixel matrix of the video sequence) the segmenting neural network is randomly modified. can comprise a generative neural network, in particular a variational autoencoder that is trained for the same classes or objects, the segmenting neural network is trained for.

The system is further configured to: - use the segmenting neural network to generate samples of semantically segmented image pixel matrices generate inter-frame variances from a sensor data stream (i.e. a video stream or a video sequence) consisting of frames - i.e. mapping corresponding pixels of consecutive frames and taking each or every n^th consecutive frame as a sample instead of sampling each frame multiple times analyze inter-class and/or inter-frame variances in the per pixel activation levels with respect to individual object classes the neural segmenting network was trained for - i.e. analyze the per pixel scores for the individual object classes. determine uncertainty scores by assessing a level or amount of uncertainty based on the analysis of the inter-class and/or inter-frame variances in the per pixel activation levels in the semantically segmented image pixel matrix; and to identify segments showing unknown objects from determined uncertainty.

Mapping of corresponding pixels of consecutive frames of a video sequence can include a determination of a displacement of corresponding pixels between two frames, e.g. based on position sensor data that can be gathered for instance by means of an inertia sensor. Thus, a movement of a video camera providing a video sequence can be determined enabling a determination of displacement vectors for the pixels between to frames that shall be used for determining inter-frame variance.

It is preferred if the system is configured to use more than one source of uncertainty at the same time, e.g. the inter-class uncertainty and the inter-frame uncertainty, and to merge these uncertainties, e.g. by means of a weighted sum where the weights depend on the number of samples already taken. For instance, using the first 2 or 3 frames for determining inter-frame variance might not yield a really good result yet, so for steps 1...m in the sequence, only the inter-class variance might be used, and for frames m+1 ...n, the weight on the inter-frame variance can be increased and the weight in the inter-class variance for the weighted sum of both can be decreased. Instead of a weighted sum, a Bayesian filter can be used, e.g. a particle filter to merge these uncertainties, and instead of a weight, a confidence score in the uncertainty scores for each of both types of uncertainties (inter-class uncertainty and inter-frame uncertainty) can be provided, and the Bayesian filter would yield a confidence score in the resulting uncertainty.

Preferably, the system further is configured to discriminate different unknown objects through analyzing uncertainty.

In a preferred embodiment, the system further is configured to - create new object classes based on analysis of uncertainty of pixels in segments that are determined as to show unknown objects.

The segmenting neural network preferably is a fully convolutional network (FCN).

The uncertainty can be aleatoric or epistemic (systemic). Aleatoric uncertainty arises from statistical variations in the environment. Aleatoric uncertainty often occurs at the border of segments in a segment map. Epistemic uncertainty (also known as systemic uncertainty) results from a mismatch in a model (for instance a model as implemented by a neural network) for instance because the neural network is not trained for a specific object class that occurs in the environment; cf. Kendall and Gal, https://arxiv.org/abs/1703.04977.

Uncertainty can be determined and quantified by analyzing the amount of variance of acti- vation levels (i.e. scores in the feature score maps) over multiple samples of the feature score maps that are generated by the segmenting neural network form one input image pixel matrix or a number of similar input image pixel matrices. For each segmentation (i.e. each pass) the segmenting neural network is modified to create variance in the segmenting neural network so the segmenting neural network becomes a Bayesian neural network that implements variational inference. In a Bayesian neural network the input image matrix is processed in a plurality of passes wherein for each pass some nodes of the CNN are dropped out. This technique is known as variational inference. Due to the dropping of nodes in some passes, the activation levels vary from pass to pass. The higher this variance of the score (i.e. activation levels) in the feature score map is, the higher is the uncertainty. Alternatively, a Gaussian Noise can be applied to the signals at the input nodes, the weights, or the activation functions, for variational inference, see the thesis of Yarin Gal, http://mlq.eng.cam.ac.uk/varin/thesis/thesis.pdf.

In order to improve the efficiency of variational inference, it is possible to apply a drop-out only to some layers or even only to one layer.

Variance of activation levels can also arise from using a sequence of pictures of the same scene taken from different locations or at different points of time. Such sequence can be frames of a video sequence recorded by a video camera of a moving vehicle. This aspect is referred to as "inter-frame variance" as mentioned earlier in this text.

If the neural network is not a Bayesian neural network, uncertainty can be determined by determining the amount of variance or a pattern of variance of the activation levels. Uncertainty can also be determined via a reconstruction error of a generative neural network.

Unknown objects (i.e. objects the neural network is not trained for or the neural network does not recognize) lead to a high level of uncertainty.

Accordingly, if unknown objects are represented by pixels of an input image pixel matrix, these pixels and thus the unknown objects represented by these pixels can be "found" by determining the uncertainty. This can be used to assign segments of the input image pixel matrix to unknown objects and even further to discriminate between different unknown objects in such segment of the input image pixel matrix.

In the following, by level or "amount of uncertainty of a pixel" a degree of uncertainty is meant that is determined by analyzing the variance of the activation levels (i.e. scores) of the elements of the score maps for the object classes. The elements (i.e. each single element) of the feature score maps (arrays) for the object classes each relate to one or more pixel of the input image matrix. Accordingly, the amount of uncertainty - and thus the uncertainty scores - that can be determined from the activation levels (i.e. scores) of the elements of the score maps also is the amount of uncertainty of the image pixel(s) that correspond to the respective elements of the score maps.

Preferred wavs of determination of uncertainty are:

The uncertainty scores can be determined based on the amount of the variance or of a pattern of the variance, or of the kind of variance (i.e. epistemic or aleatoric) or on the temporal variation of the variance of the activation levels (i.e. scores of the feature score maps) or of a combination of these parameters.

The uncertainty determined from the activation levels can be mapped on the input image pixel matrix according to the size of the inputs orthe size of the segment map. This means that the size of the uncertainty score map holding the uncertainty scores is the same as the size of the feature score maps/segment map and must be scaled up to the size of the segment map, like the feature score maps must be scaled up to the size of the input image pixel matrix, by means of bilinear interpolation.

The existence of a segment that represents an unknown object can be determined based on the distribution of the amount of uncertainty (i.e. the distribution of the uncertainty scores in the uncertainty score map) assigned to pixels of the input image matrix wherein the amount of uncertainty (i.e. the uncertainty score of each element/pixel) is determined based on the variance of the activation levels of elements of the feature score maps that relate to the pixels of the input image matrix. An image segment representing an unknown object can be determined by determining a cluster of pixels with a high uncertainty score (i.e. for which a high amount of uncertainty is found) or by determining contiguous pixels having a similar amount of uncertainty (and thus a similar uncertainty score). The amount of uncertainty and thus the uncertainty score of a pixel is determined by determining the variance of the activation levels of those elements of the feature score maps that correspond to the pixel in the input image pixel matrix.

The size of a segment representing an unknown object can be determined from the size of a cluster of pixels with a high uncertainty score or by the length of the outline of a field of contiguous pixels having a similar uncertainty score.

The position of a segment representing an unknown object can be determined from the position of a cluster of pixels having a high uncertainty score or by determining the position of the outline of contiguous pixels exhibiting a similar uncertainty score.

The relative movement of a segment representing an unknown object can be determined from the temporal change of the position of a cluster of pixels having a high uncertainty score or of the position of an outline of contiguous pixels exhibiting a similar uncertainty score. The relative distance between a camera generating the input pixel matrix and an unknown object represented in a segment can be determined from the temporal change of the variance of the level of activations. Thus, a 3D position of the unknown object from a sequence of 2D input image pixel matrices can be computed. Furthermore, the relative distance between a camera generating the input pixel matrix and an unknown object represented in a segment can be determined by an additional depth estimation model.

Preferred wavs of checking a plausibility that a segment represents an unknown object

Since the segmenting neural network will assign every pixel to a known object class, un- known objects are not recognized by the segmenting neural network. However, pixel that represent a yet unknown object can be determined by means of analyzing uncertainty.

Ways to make it plausible that a segment of the input image matrix represents an unknown object - and thus ways to find a plausible confirmation that segments of pixels with a high amount of uncertainty represent an unknown object are: - The plausibility of a segment representing an unknown object can be derived from the determined outline of the segment and its correlation with the segment prediction, i.e. the semantic segmentation, produced by the neural network. If an entire segment is comprised of pixels with a high uncertainty score (i.e. the majority of pixels within a segment outline) it is likely that the segment represents an unknown object. - The plausibility of a segment representing an unknown object can be determined from the detected outline of the segment and its correlation with an alternative segmentation of the input image matrix, for instance by means of an auto-encoder.

The plausibility of a segment representing an unknown object can be determined from the temporal variation of the variance of the activation levels.

The plausibility of a segment representing an unknown object can be determined by means of a comparison of that segment with another segment representing an unknown object found by a coexisting, redundant subsystem for semantic segmentation. The plausibility of a segment representing an unknown object can be determined based on a comparison with another segment representing an unknown object found by a different system for semantic segmentation that uses an input image matrix representing the same scenery taken from a different perspective. Such comparison includes a transformation of the two segments representing an unknown object in a common, e.g. global coordinate system.

According to another aspect, a system is provided that comprises a primary segmentation system and a separate autonomous device.

The primary segmentation system comprises a primary perception system comprising a segmenting neural network that is configured and trained for segmentation of an input image pixel matrix to thus generate a segment map. Each element of the segment map is assigned to one of a plurality of object classes the segmenting neural network is trained for by way of class prediction. Elements that are assigned to the same object class form a segment of the segment map. The primary segmentation system may be part of an auton- omous driving system (ADS) and typically does not comprise an uncertainty detector.

The separate, autonomous device comprises a sensor for generating an input image pixel matrix comprised of matrix elements, a segmenting neural network and an uncertainty detector. The uncertainty detector is configured to generate an uncertainty score map composed of matrix elements that correspond the pixels of the input image pixel matrix. Each matrix element of the uncertainty map has an uncertainty score that is determined by the uncertainty detector and that reflects an amount of uncertainty involved in a class prediction for a corresponding element in the segment map. The segmenting neural network of the autonomous device preferably is trained for the same object classes as the segmenting neural network of the primary system. The autonomous device preferably further comprises signaling means that are configured to generate a user perceivable signal in case the uncertainty map generated by the uncertainty detector of the autonomous device comprises regions comprised of matrix elements that exhibit an uncertainty score above a threshold and thus represent edge cases for the object classification and image segmentation. Edge cases may represent yet unknown ob- jects and thus are candidates for new object classes. The autonomous device can determine whether or not the segmentation achieved by the primary perception system is reliable or not. In particular, the autonomous device can determine whether the segmentation score map generated by the segmenting neural network of the primary segmentation system contains regions that represent edge cases, for in- stance objects, the primary segmentation system was not trained for.

The autonomous device preferably comprises a new object detector that is operatively connected to the uncertainty detector of the autonomous device. The new object detector is configured to find a region in the uncertainty score map that is composed of elements having a high uncertainty score and generate a new object class for such found region. The autonomous device can be configured to exchange labeled data sets comprising segments assigned to a newly generated object class with other autonomous devices to thus increase the number of known object classes available for semantic segmentation by the autonomous devices. The label for a newly generated object class may be generated automatically. The autonomous device may be configured to record a reaction in response to a warning signal emitted by the autonomous device. The type of reaction can be used as input data for discriminating relevant unknown objects from less relevant unknown objects. A significant reaction indicates a high relevance of an unknown object. An absence of a reaction indicates a low relevance of an unknown object. The autonomous device can be configured for learning to discriminate unknown objects by automatically generating new object classes. Furthermore, the autonomous device can be configured for learning a relevance level for each (new) object class and thus can adopt the warning signal level to the object class. If the autonomous device determines the presence of an object (by way of image segmentation) in a region of interest of the input image pixel matrix, the warning signal can be generated in dependence of the relevance of the recognized object. New object classes maybe labeled automatically such that the label represents the relevance level. For instance, the relevance level may be used as label for each new object class.

The autonomous device can be configured to exchange data sets comprising data repre- senting the relevance level of a known object class or an observed user reaction (observed behavior) with other autonomous devices. The observed user reaction can be used for automatically generating a label for a new object class. For exchanging data sets with other autonomous devices, the autonomous device may comprise a data interface, in particular a wireless data interface that allows data exchange over e.g. the internet.

The autonomous device preferably is a pocketable, mobile device a user easily can carry and that can easily be mounted for instance to a windshield of a vehicle. In use, the auton- omous device preferably is mounted in a position where the autonomous device's viewing angle at least in part corresponds to the viewing angle of the sensor 12 or sensors of the autonomous driving system.

The autonomous device may comprise a segmenting neural network that implements a semantic segmentation model, which is preferably continuously trained with the output of the new object detector as input to thus enable the neural network to predict new objects encountered before.

The invention shall now be further illustrated by way of examples with reference to the figures. In the figures

Fig. 1 : is a schematic overview of a perception system providing output for other sys- terns, e.g. a sensor fusion system;

Fig. 2: is schematic representation a neural network suitable for semantic segmentation

Fig. 3: is a schematic illustration of how an object class is assigned to a pixel of the input image pixel matrix depending on the scores in the score maps of the respective object classes

Fig. 4: is a schematic overview of a data processing unit for use in a perception system according to figure 1 including an uncertainty detector according to the invention;

Fig. 5 illustrates training of a neural network based on input data sets comprising images and labels (i.e. desired output, ground truth);

Fig. 6: illustrates prediction of a neural network based on an input data set; Fig. 7: shows an image as represented by an input data set for the semantic segmentation neural network;

Fig. 8: shows the prediction, i.e. the semantic segmentation provided by the neural network for the image in figure 5; Fig. 9: shows regions and segments wherein the pixels have a high amount of uncertainty for the image of figure 5;

Fig. 10: is a schematic illustration of a system comprising a perception subsystem, a planning subsystem and a control subsystem;

Fig. 11 : illustrates an unknown object plausibility check based on two independent perception subsystems of one vehicle;

Fig. 12: illustrates an unknown object plausibility check based on two independent perception subsystems of two vehicles;

Fig. 13: shows the correlation of the shape of uncertainty with an unknown object; Fig. 14: illustrates the determination of a segment with an unknown object; Fig. 15: illustrates sourcing of a domain adaptation dataset;

Fig. 16: illustrates uploading and integration of a domain adaptation dataset; Fig. 17: illustrates the determination of classes of unknown objects; Fig. 18: illustrates training of a neural network with a domain adaptation dataset to generate a new model that can process new object classes; Fig. 19: illustrates training and download of a model for a neural network;

Fig. 20: illustrates an autonomous device that serves as a safety companion;

Fig. 21 : illustrates generation of a training data set for a new object; Fig. 22: illustrates a training data set for a new object;

Fig. 23: illustrates sharing of a training data set for a new object; Fig. 24: illustrates training a segmenting neural network with a training data set for a new object; Fig. 25: illustrates the identification of false positive new objects;

Fig. 26: illustrates a perception system with a secondary system and a primary system for object detection;

Fig. 27: illustrates transfer learning from a segmenting neural network with an unknown structure; Fig. 28: illustrates the training of a variational autoencoder;

Fig. 29: illustrates uncertainty detection by means of variational autoencoder; Fig. 30: illustrates one-shot learning of an uncertainty detector; Fig. 31 illustrates the use of multiple parallel redundant uncertainty detectors; Fig. 32 illustrates an alternative device for object recognition; Fig. 33 illustrates a segmentation model that can be part of an object recognition system;

Fig. 34 illustrates the use case of action recognition; Fig. 35 illustrates a video action recognition model;

Fig. 36 illustrates a temporal shift module; illustrates a depth estimation model; Fig. 38 illustrates the use case of risk estimation;

Fig. 39 illustrates the use case of action anticipation; Fig. 40 illustrates an action forecast model; Fig. 41 illustrates the use case of edge case recognition; Fig. 42 illustrates cascaded monitoring concept;

Fig. 43 illustrates a situation monitor; Fig. 44 illustrates an autoencoder model; Fig. 45 illustrates a convolutional submodule; Fig. 46 illustrates a convolutional submodule with dropout; Fig. 47 illustrates a Bayesian sampling module;

Fig. 48 illustrates a Lipschitz submodule; Fig. 49 illustrates integration of the situation monitor with a Kalman filter; Fig. 50 illustrates the use case of similarity prediction; Fig. 51 is an overview over a multi-functional model; Fig. 52 illustrates an encoder of a multi-functional model;

Fig. 53 illustrates segmentation by means of a multi-functional model; Fig. 54 illustrates video action recognition by means of a multi-functional model;

Fig. 55 illustrates depth estimation by means of a multi-functional model; Fig. 56 illustrates an autoencoder realized with a multi-functional model; Fig. 57 illustrates action forecast by means of a multi-functional model; Fig. 58 illustrates how the system can be implemented following the Sense-Plan-Act concept; and Fig. 59 illustrates a preferred sensor configuration.

A perception system 10 as shown in figure 1 comprises a camera 12 that records images by means of an image sensor (not shown). The recorded images can be still images or sequences of images that form frames of a video stream. The input data stream can also be from a LiDAR (stream of 3D point clouds) or from a radar. The camera 12 generates an image pixel matrix that is fed to a data processing unit 14. The data processing unit 14 implements a segmenting neural network 40 (see figures 2 and 3) for semantic segmentation of images.

The perception system can be integrated in various devices or machines, in particular in vehicles, for instance autonomous vehicles, e.g. cars. One aspect is an integration of a perception system with an autonomous vehicle. In such autonomous vehicle, the system that implements the segmenting neural network is a perception system 10. The output of the perception system 10 is provided to a sensor fusion system 24.

The neural network as implemented by the data processing unit 14 is defined by a structure of layers comprising nodes and connections between the nodes. In particular, the neural network comprises an encoder part formed by convolution layers and pooling layers. The convolutional layers generate output arrays that are called feature maps. The elements of these feature maps (i.e. arrays) represent activation levels that correspond to certain features in the input image pixel matrix. Features generated by one layer are fed to a next convolutional layer generating a further feature map corresponding to more complex features. Eventually, the activation levels of a feature map correspond the objects belonging to a class of an object the neural network was trained for. The effect ofthe convolution in the convolutional layers is achieved by convoluting the input array with filter kernel arrays having elements that represent weights that are applied in the process of convolution. These weights are generated during training of the neural network for one or more specific object classes. Training of the neural network is done by means of training data sets comprising image data (input image pixel matrix 66, see figure 5) and labels 68 that indicate what is represented by the image data (ground truth). The image data represent the input image matrix 66 and the labels 68 represent the desired output. In a back propagation process, the weights and the filter kernel arrays are iteratively adapted until the difference between the actual output of the CNN and the desired output is minimized. The difference between the actual output (segment map 72) of the CNN and the desired output (labeled input image matrix 66) is calculated by means of a loss function 70.

From a training dataset, containing pairs of an input image and a ground truth image that consists of the correct class labels, the neural network 40 computes the class predictions for each pixel in the output image, i.e. the segment map 72. In training, a loss function 68 compares the input class labels 68 with the predictions (i.e. the labels 68' in the segment map 72) made by the neural network 40 and then pushes the parameters - i.e. the weights in the nodes of the layers - of the neural network in a direction that would have resulted in a better prediction. When this is done again and again with lots of pairs of input images and ground truth images, the neural network will learn the abstract concepts of the given classes; cf. figure 5.

A trained neural network is thus defined by its topology of layers (i.e. the structure of the neural network), by the activation functions of the neural network's nodes and by the weights of the filter kernel arrays and potential the weights in summing up nodes of layers such as fully connected layers (fully connected layers are used in classifiers).

The topology and the activation functions of a neural network - and thus the structure of a neural network - is defined by a structure data set.

The weights that represent the specific model a neural network is trained for are stored in a model data set. Of course, the model data set must fit the structure of the neural network as defined by the structure data set. At least the model data, in particular the weights determined during training, are stored in a file that is called "checkpoint". The model data set and the structure data set are stored in a memory 16 that is part of or is accessible by the data processing unit 14.

The data processing unit 14 is further connected to a data communication interface 18 for exchanging data that control the behavior of the neural network, e.g. model data as stored in a model data set or training data as stored in a training data set used for training the neural network.

For visualizing a segmentation provided by the neural network, a display unit 20 may be connected to the data processing unit 14. However, in an autonomous vehicle, the segmentation will not be shown on a display unit but post-processed into an object list and then provided as input to a sensor fusion system, e.g. a Kalman filter.

Information of the absence or presence and - if present - the position of an image representation of an object in the input image pixel matrix can be derived from the segmented input image pixel matrix. This information can be encoded in data and can be used for control or planning purposes of further system components. Accordingly, an output data interface 22 is provided that is configured to pass on data indicating the absence or presence and the position of a recognized object in the input image pixel matrix. To the output data interface 22, a sensor fusion system 24 can be connected. The sensor fusion system 24 typically is connected to further perception systems not shown in the figure.

The sensor fusion system 24 receives input from various perception systems 10, each within or associated with one particular sensor such as a camera, a radar or a LiDAR. The sensor fusion system 24 is implemented by a Bayesian filter, e.g. an extended Kalman filter, which processes the inputs of the perception systems 10 as measurements. For the extended Kalman filter, these inputs are not the segment maps, but post-processed object lists, derived from the segment maps. The extended Kalman filter will associate each meas- urement with a measurement uncertainty value, which is usually configured statically, e.g. a sensor uncertainty value, configured according to a model of the particular sensor. In the extended Kalman filter, multiple sources of such uncertainty for a measurement are added and normalized. Here, the uncertainty scores generated by the uncertainty detector 60 (see figure 4) can be easily integrated by adding a model uncertainty to the measurements. This is how an uncertainty detector can be integrated with an autonomous vehicle. Image segmentation

The trained segmenting neural network 40 has an encoder - decoder structure as schematically shown in figure 2. The encoder part 42 is a fully convolutional network (FCN), for example a ResNet 101. Alternative implementations could be VGG-16 or VGG-19.

The decoder part 44 can be implemented as a DeepLab aspp (atrous spatial pyramid pooling).

The segmenting neural network 40 is a fully convolutional network composed of convolutional layers 46, pooling layers 48, feature maps 50, score maps 52 (i.e. up-sampled feature maps) and one segment map 54.

In the example in figure 2, the segmenting neural network 40 is configured and trained for three object classes. Accordingly, on each level, three convolutional layers 46 are provided that each generate a feature map for one object class. Each convolutional layer implements ReLU activation functions in the nodes. Pooling layers 48 reduce the size if the feature map to thus generate an input for a next convolutional layer 46. Feature maps 50 of the trained segmenting neural network each reflect the likelihood that an object of one object class is represented by according elements of the input image pixel matrix. Typically, the feature maps have a smaller dimension than the input image pixel matrix. In the decoder part 44 of the segmenting neural network the feature maps are up-sampled and up-sampled feature maps 52 (herein also called score maps) are generated wherein the elements have score values that reflect the likelihood that a certain pixel represents an object of the object class the segmenting neural network was trained for. For segmenting the input image pixel matrix each element (i.e. each pixel) is assigned to one of the object classes the segmenting neural network was trained for. This can be done by using the argmax function (a maximum likelihood estimator), that compares the scores of corresponding elements in the feature score maps for each object class. A pixel is assigned to that object class showing the highest score for that pixel or element, respectively. Thus, one segment map 54 is generated from three feature score maps 52.

According to a preferred embodiment, the ReLU activation function applied is a customized ReLU function that allows introducing random dropping of nodes so the segmenting neural network 40 can act as Bayesian neural network. Figure 3 illustrates that the feature maps 50 and 52 for each object class A, B or C are acting in parallel (and not in series as figure 2 might suggest). If, for instance, the neural network is trained for three objects, three score maps are generated. By means of a Soft- Max function, the scores in each feature score map are normalized on a scale between 0 and 1 . The normalized scores for corresponding elements of the feature maps can be compared and the element can be assigned to that object class that corresponds to the feature score map having the highest score for the respective element. This is done by means of a maximum likelihood estimator using an argmax function.

In figure 3, segment 56 is a segment, where the activation levels of the array elements are higher than the activation levels of corresponding array elements of the score maps for object classes A and B. Segment 58 is a segment, where the activation levels of the array elements are higher than the activation levels of corresponding array elements of the score maps for object classes B and C. Please note that each pixel in the segment map 54 is assigned to one object class of all the object classes the segmenting neural network 40 is trained for. Accordingly, there are no non-assigned elements or pixels. Figure 3 might be misleading in that respect because figure 3 does not show that all pixels are assigned to a known object class and thus are part of a segment

Figure 6 illustrates the generation of a segment map 72 from an input image 66 by means of a prediction performed by the segmenting neural network 40. The segmenting neural network 40 is part of the data processing unit 14.

Figures 7 and 8 illustrate that all pixels in an input image are assigned to known objects. In case the input image 66' - and thus the input image pixel matrix 66' - comprises unknown objects like boxes 90 on the street, these unknown objects 90 are assigned to known objects such as street markings 92. Uncertainty detection

For planning and control purposes, it is helpful to have an indication regarding the reliability of the data provided by the output data interface 22.

Therefore, it is an object to determine a level (or amount) of uncertainty involved in the data, i.e. in the semantic image segmentation as provided by the neural network. It is a further object to recognize image segments or parts of image segments that cannot reliably be assigned to an object. According to one concept of the invention image regions that cannot reliably be assigned to an object can be recognized by way of determined uncertainty. Uncertainty means model uncertainty from Bayesian statistics; cf. Kendall and Gal, https://arxiv.Org/abs/1703.04977, and in one of their sources, https://www. repository. cam. ac.uk/bitstream/handle/1810/248538/Ghahramani%202015%20Nature.pdf;ises- sionid=0492F830B1 DAA2A6A1 AE004766B0E064?seauence=1 .

In a first step, uncertainty scores on the level of pixels (per pixel) of the input image pixel matrix are created or generated. This can be achieved by provoking or determining forms of variance in the activation levels (scores) of the individual pixels with respect to the object classes.

The activation levels (i.e. the score) of the elements in the feature score maps for the individual object classes the neural network is trained for may vary, if frames of a video sequence are regarded or if the semantic segmentation is repetitively performed in multiple passes where varying nodes are dropped. The variance can be temporal - i.e. between different passes or between different frames of a video sequence - or spatial, i.e. within an image pixel matrix, for instance at the edges of segments.

The variance can be achieved by setting the activation to zero with a random probability. A dropout operation is inserted into the layers to which dropout shall be applied, so the inner architecture of the layer looks like convolution/dropout/non-linearity.

Specifically, amounts of uncertainty can be determined and quantified by analyzing the amount of variance of activation levels (i.e. scores in the feature score maps) over multiple samples. In a Bayesian neural network (i.e. for instance a convolutional neural network that is randomly modified, e.g. by randomly dropping nodes) the input image matrix is pro- cessed in a plurality of passes wherein for each pass some nodes of the convolutional neural network are dropped out. This technique is known as variational inference. Due to the dropping of nodes in some passes, the activation levels vary from pass to pass resulting in an inter-sample per pixel variance. The higher this variance is, the higher is the amount of uncertainty and thus the uncertainty score. Inter sample variance of activation levels can also arise from using a sequence of pictures of the same scene taken at different points of time. Such sequence can be a frame of a video sequence recorded by a video camera of a moving vehicle. If the neural network is not a Bayesian neural network, per pixel uncertainty can be determined by determining the amount of inter-class variance or a pattern of variance of the activation levels (spatial variance).

Uncertainty can also be determined via a reconstruction error of a generative neural net- work, for instance by means of a variational autoencoder.

Unknown objects (i.e. objects the neural network is not trained for or the neural network does not recognize) lead to a high level of uncertainty and thus to high uncertainty scores of the pixel of a segment representing an unknown object.

The uncertainty map represents the uncertainty score for each pixel. The uncertainty map is created by analyzing the activation levels of the matrix elements (corresponding to the image pixels) in the feature score maps for the different object classes, the segmenting neural network is trained for. The uncertainty score map can be created from analyzing the variance of the activation levels over a plurality of passes when applying i.e. Monte Carlo drop out to the segmenting neural network to thus cause variance between the samples generated with each pass (inter-sample variance). For tracking the variance over the passes and generating an uncertainty map, an uncertainty detector may be provided.

Accordingly, if unknown objects are represented by pixels of an input image pixel matrix, these pixels and thus the unknown objects represented by these pixels can be "found" by determining the uncertainty scores. This can be used to assign segments of the input image pixel matrix to unknown objects and even further to discriminate between different unknown objects in such segment of the input image pixel matrix.

The basic steps the system according to the invention performs are: using a segmenting neural network to generate samples of semantically segmented image pixel matrices - analyze inter-class and/or inter-frame variances in the per pixel activation levels with respect to individual object classes the neural network was trained for - i.e. analyze the per pixels scores for the individual object classes. determining uncertainty by assessing a level or amount of uncertainty based on the analysis of the inter-class and/or inter-frame variances in the per pixel activation levels in the semantically segmented image pixel matrix; and identify segments showing unknown objects from determined uncertainty Preferably, the system further discriminates different unknown objects through analyzing uncertainty and optionally creates new object classes based on analysis of uncertainty of pixels in segments that are determined as to show unknown objects.

For performing uncertainty detection the data processing unit 14 comprises an uncertainty detector 60, see figure 4.

The uncertainty detector 60 can be configured to determine uncertainty from the feature score map 52 provided by the segmenting neural network 40 by way of analyzing the interclass variance of the values of the activation levels before the maximum likelihood estimator, i.e. before the argmax functions is applied and the segment map 54 is generated. Be- fore the maximum likelihood estimator, i.e. before the argmax function is applied and the segment map 54 is generated, each pixel has activation levels for each known class. By way of the argmax function, the pixel typically is assigned to the class with the highest activation level. Prior to applying the argmax function interclass variances for each pixel can be determined. Alternatively or additionally, the uncertainty detector 60 can comprise a generative neural network 62 implementing a generative model that is based on the same training dataset as the segmenting neural network 40 to reproduce the input image. The generative neural network 62 preferably is a variational autoencoder. The per pixel reconstruction loss between the input image and the image reconstructed by the generative neural network 62 corresponds to the uncertainty. A higher reconstruction loss reflects a higher amount of uncertainty.

Preferably, the uncertainty detector 60 is configured to implement a Bayesian neural network as means for generating inter-sample variance and thus an uncertainty score. In this embodiment, the uncertainty detector 60 is preferably configured to apply a customized ReLU function that allows introducing random dropping of nodes so the segmenting neural network 40 can act as a Bayesian neural network. Uncertainty can thus be determined by means of repeating the prediction several times (sampling) under variation of the segmenting neural network 40 (e.g. dropout of nodes) or insertion of noise into the weights or acti- vation functions. The activation levels in the score maps (after the softmax function but before the maximum likelihood estimator (argmax)) over all the samples will then have a high inter-sample variance for the parts of the input which do not correspond to the training data and a low variance for the parts of the input which correspond well to the training data. A tensor library 64 comprises the customized ReLU used for rendering the segmenting neural network 40 into a Bayesian neural network.

When in use, the data processing unit 14 receives input image pixel matrices 66 as an input. The data processing unit 14 generates a segment map 72 and an uncertainty score map 74 as outputs.

Figure 9 illustrates an uncertainty score map 74 generated from the input image shown in figure 7. Figure 8 illustrates the segment map generated from the input picture shown in figure 7. As can be taken from figure 9, the unknown objects 90 (i.e. boxes on the street) are represented by segments wherein the pixel exhibit a high amount of uncertainty - and thus bear a high uncertainty score, reflected by a lighter color in figure 9. In the segment map 74 as shown in figure 8 the pixels that represent the unknown objects (boxes on the street) are assigned to various known objects. While this is correct in terms of the function of the segmenting neural network, from a user's perspective the segmenting result is wrong as far as the unknown objects are concerned. Such "wrong" assignment can be detected by means of the uncertainty score of the pixels in the segments (priorto assigning the pixels to the known object classes, e.g. prior to applying the function for classifying each pixel). Detailed description of the determination of the amount of variance

The amount of variance can be determined by provoking and/or analyzing forms of variance in the activation levels (scores) of the individual elements of the feature score maps for the object classes the segmenting neural network is trained for.

The amount of variance can be determined between different samples of a feature score map for one object class, thus determining an inter-sample variance. The amount of variance can be determined between feature score maps for different object classes, thus determining an inter-class variance.

These approaches can be combined.

Detailed description of the determination of the amount of uncertainty (uncertainty scores) - based on the variance as disclosed above

For corresponding elements of the score maps (i.e. matrix of activation levels for each object class)

The uncertainty detector can be configured to determine prediction uncertainty in various ways. The per-pixel uncertainty scores that indicate prediction uncertainty can for instance be determined: by means of the values of the activation levels before the maximum likelihood estimator, where the uncertainty detector compares the relative activation levels of respective pixels in the feature score maps, e.g. by means of class agnostic threshold on the activation levels, by means of class specific thresholds on the activation lev- els, or by respective thresholds on the difference between the activation levels of classes, e.g. the top-2 activation levels, where a higher difference would indicate lower uncertainty and vice versa. by means of training a generative model on the same training dataset, e.g. a variational autoencoder to reproduce the input image and take advantage of the fact that the variational autoencoder will fail to reproduce unknown object classes. By means of computing the reconstruction loss in prediction mode as one would do in training mode, a pixel-wise (per pixel) uncertainty score is determined, where a higher reconstruction loss would indicate higher uncertainty. by means of implementing a Bayesian neural network by variational inference, e.g. realized by the application of Monte Carlo dropout in training mode and in prediction mode of the segmenting neural network in combination with sampling in prediction mode of the segmenting neural network, where the uncertainty detector takes advantage of the fact that unknown object classes (i.e. unknown objects in the input image pixel matrix) will result in higher pixel-wise inter-sample variance between the samples, and higher variance would indicate higher uncertainty and vice versa, or. e.g. realized by the application of Gaussian noise to the values of the weights or the activation functions with the same effect as Monte Carlo dropout.

Detailed description as howto assign an amount of uncertainty (uncertainty level) to each pixel of the input image pixel matrix

The uncertainty detector can take e.g. the values of the pixel-wise activation levels in the feature score maps, the reconstruction loss values of the pixel-wise reconstruction loss of a variational autoencoder, or the values of the pixel-wise variance in variational inference or a combination of these for determining uncertainty scores for the elements of the uncer- tainty score map. Lower activation levels indicate higher uncertainty scores. A lower difference between the activation levels of the top-2 classes or a higher variance over all object classes indicates higher uncertainty scores. A higher variance between the samples in variational inference indicates higher uncertainty scores. In the case of variational inference, there are variances per pixel per object class, which the uncertainty detector aggregates into one variance per pixel, e.g. by taking the sum or by taking the mean.

Detailed description of determining segments (of the input image pixel matrix) that represent unknown objects based on the uncertainty score of each pixel

Determining segments that represent an object of an unknown object class is done to determine whether the perception system 10 operates within or without its operation domain. The domain is defined by the object classes, the segmenting neural network 40 is trained for.

Determining segments that represent an object of an unknown object class thus is part of an out-of-domain detection. In figures 10 to 13, the out-of-domain detection is simply marked as domain detection. The detection of segments representing objects of a yet un- known, new object class can be used creating models representing the new object class and thus for domain adaptation.

Just like the predictions of the segmenting neural network - i.e. the feature score maps reflecting the activation levels of the elements for each object class and the segment map derived therefrom -, the uncertainty, derived from the variance or the activation levels, is a pixel image, so for each pixel in the segment map exactly one corresponding uncertainty score is found in the uncertainty score map. If the uncertainty detector comprises a Bayesian neural network as means for detecting inter-sample uncertainty and thus uncertainty scores for the pixels or elements, respectively, aleatoric uncertainty will show at the edges of segments (see figure 9, the light colored "frame" around a segment). The origin of aleatoric uncertainty is due to the fact that in the labelling process, the segments in the training dataset have been labelled manually or semi-automatically by humans, who will label the edges of objects sometimes more narrow and sometimes more wide. The neural network will learn this variety of labelling styles, and sometimes predict the edges of segments more narrow and sometimes more wide. This also happens when the samples for variational inference are taken in prediction, lead- ing up to uncertainty at the edges of segments. The uncertainty detector preferably is configured to match the regions in the uncertainty map in which aleatoric uncertainty occurs to the edges of the predicted segments in the segment map by means of standard computer vision techniques for edge detection, corner detection, or region detection (aka area detection). Where the aleatoric uncertainty matches the edges of predicted segments, a correct prediction of the segment is indicated. If aleatoric uncertainty is missing at the edge of a predicted segment in the segment map or if aleatoric uncertainty exists within a segment, an incorrect prediction is indicated.

If the uncertainty detector comprises a Bayesian neural network as means for detecting inter-sample uncertainty and thus for generating uncertainty scores for the pixels or ele- ments, respectively, epistemic uncertainty will show inside of segments (see figure 9, the light colored segments 90). So will the other approaches for detecting uncertainty show uncertainty inside of segments. Epistemic uncertainty directly indicates an incorrect segment prediction, i.e. an incorrect segmentation in the segment map. The uncertainty detector is configured to determine regions of uncertainty e.g. by means of standard computer vision techniques for edge detection, corner detection, or region detection (aka area detection). Besides the fact, that there is uncertainty indicated by pixels in the uncertainty score map, that these pixels gather at the edges of segments of the segment map or within uncertain segments of the segment map, the uncertainty values of these per pixel uncertainties also vary within segments. This kind of uncertainty may extend into other segments of the segment map and the corresponding regions of the uncertainty score map where the segment predictions have been correct (false positive), or do not entirely fill a segment where the segment predictions have been incorrect (false negative). To capture these false positives and false negatives of the uncertainty scores itself, the uncertainty detector can comprise a small classifying neural network, i.e. a classifier, that is trained with the uncer- tainty score map in combination with labelled training dataset. For this training, object classes are used that do not belong to the original training domain (as defined by the object classes the segmenting neural network is originally trained for) of the original segmenting neural network. By means of this classifier, the uncertainty detector optimizes the matching of the determined regions of uncertainty of the uncertainty score map where the elements of the uncertainty score map have high uncertainty scores to the real shapes of the incor- rectly predicted segments in the segment map. Thus, the uncertainty score map is segmented in accordance with the segment map. The uncertainty detector plausibilizes the regions determined by this classifier with the regions determined by means of the standard computer vision techniques.

Fig. 14 illustrates the prediction by a neural network and the measurement of prediction uncertainty. From the input image on the left side, the segmenting neural network computes the class predictions for each pixel on the right side and the uncertainty detector computes uncertainty value predictions for each pixel. Fig. 13 shows the correlation of the shape of uncertainty with an unknown object.

The uncertainty detector can be implement in different ways. For a Bayesian neural network prediction uncertainty can be determined by means of repeating the prediction several times (sampling) under variation of the segmenting neural network (e.g. dropout of nodes) or insertion of noise into the weights or activation functions. The activation levels in the score maps (after the softmax function but before the maximum likelihood estimator (argmax)) over all the samples will then have a high inter-sample vari- ance for the parts of the input which do not correspond to the training data and a low variance for the parts of the input which correspond well to the training data. This embodiment requires that the uncertainty detector interacts with the segmenting neural network and randomly modifies the segmenting neural network and in particular at least some nodes of the convolutional layers of the segmenting neural network. The uncertainty detector can for instance be configured to modify at least some of the ReLU activation functions of the convolutional layers to thus cause dropping of nodes or to modify values of weights in some nodes.

For neural networks which are not bayesian, the uncertainty values can be determined from only one sample by way of determining the inter-class variance between the softmax scores over the score maps (every pixel has a softmax score in every score map). This inter-class variance can be determined e.g. as the difference between the top-2 softmax scores, i.e. the difference between the largest softmax score and the second largest softmax score, or as the variance between the softmax scores over all object classes. Instead of the variance, a threshold could be applied. This embodiment does not require that the uncertainty detector modifies the segmenting neural network.

As a third alternative embodiment, the uncertainty detector can comprise a generative neural network, e.g. a variational autoencoder, to recreate the input image and measure the uncertainty by exactly the same means as described above or by determining a reconstruction loss. The generative neural network, e.g. a variational autoencoder, implements the same model (i.e. is trained for the same objects or classes) as the segmenting neural network.

Preferably, the uncertainty detector is configured to implement a Bayesian neural network as means for generating inter-sample uncertainty. However, the process of sampling consumes a lot of time as a statistically relevant number of samples is needed for a reliable result. Instead of sampling each instance of the data sequence received from the sensor many times as would be done in the simplest case, the uncertainty detector preferably is configured to compute the inter-sample variance over subsequent instances (for instance subsequent frames of a video sequence), while sampling each instance only one or a few times. If the instances are input image pixel matrices corresponding to frames of a video sequence recorded by a vehicle, this is possible because the pixels of these instances correspond to each other within a variable amount of shift, rotation, and scale due to the movement of the vehicle. The uncertainty detector is configured to match single pixels or regions of pixels of one instance to corresponding pixels or regions of pixels of subsequent instances to determine the inter-sample variance between the feature score values of these pixels.

Regions in an uncertainty score map generated by the uncertainty detector that exhibit elements with a high uncertainty score are candidates for segments that represent an un- known object, i.e. not the object the segment is assigned to in the segment map and thus not an object belonging to an object class the segmenting neural network is trained for.

Detailed description on confirming the plausibility that a segment determined as representing an unknown object indeed represents an unknown object.

The uncertainty detector preferably is configured to further plausibilize a segment deter- mined as representing an unknown object by means of communication with other vehicles in the vicinity of a scene. The uncertainty detector will map the scene to a global coordinate system, which is e.g. the coordinate system of the HD map used by the vehicle for localization. Thus, the locations of objects in the global coordinate system are detected that correspond to segments in the segment map corresponding to known or to unknown objects. If segment maps generated from input image pixel matrixes recorded from different locations are compared, it is possible to compare segments that are candidates for representing an unknown object. Input image pixel matrices from different locations can origin for instance from two different cameras of one vehicle or from the cameras of two different vehicle, see figures 11, 12 and 13. The uncertainty detector of a first vehicle will send the coordinates of the unknown object to other vehicles in the vicinity of the scene, so another uncertainty detector within a receiving vehicle can match the segment received to the corresponding segment of the first vehicle. A match will increase the probability that the object is indeed of an unknown object class. A missing match will decrease this probability. For the case where the vehicle uses the uncertainty detector output for initiating a minimal risk maneuver, the uncertainty detector uses this means of plausibilization only if the potential unknown object is sufficiently far away so the entire procedure of vehicle-to-vehicle plausibilization takes no more time than the fault-tolerant-time-interval until it would be too late to initiate the minimal risk maneuver. For the case of recording of the potential unknown object in a dataset for domain adaptation, this means of plausibilization can always be applied. To determine the fault- tolerant-time-interval, the 3D position of the potential unknown object is required, which the uncertainty detector determines e.g. by means of monocular depth estimation, or by inference from motion.

As a second method of vehicle-to-vehicle plausibilization, the uncertainty detector within one vehicle will send information identifying the segment of the unknown object to other vehicles which do not have to be in the vicinity of the scene. The uncertainty detector in the receiving vehicle will patch the received segment into an unrelated scene observed by the other vehicle and compute the uncertainty. If the segment has high uncertainty in the unrelated scene, the uncertainty detector increases a probability value indicating the proba- bility that the object is indeed of an unknown object class. If the identified segment has low uncertainty in the unrelated scene, the uncertainty detector will decrease the probability value reflecting the probability that the object is indeed of an unknown object class and instead assign the object class determined with low uncertainty to the object. In the case where this computation would take too much time, the uncertainty detector will perform this computation only locally within one vehicle, and the unrelated scene will be from an unrelated dataset, allowing for the further plausibilization with respect to the labels provided with this dataset.

Detailed description of automatic generation of new object classes based on the analysis of segments representing yet unknown objects

The uncertainty detector preferably will create a dataset of new object classes from the unknown objects identified by the uncertainty detector. The dataset will be comprised of the sensor input data - i.e. an input image pixel matrix, corresponding to the unknown object, together with a matrix of plausibilized uncertain pixels (in case of video input) or points (in case of LiDAR input) as labels, see figure 14. However, from all the instances of yet unknown objects together with their labels created by the uncertainty detector, which identify their position but not their real object class yet, a set of real object classes needs to be derived. Preferably, the uncertainty detector is configured to group instances of yet unknown objects into candidate object classes by means of unsupervised segmentation. Pref- erably, the uncertainty detector is also configured to determine possible candidate object classes by means of one-shot learning where a visual input is mapped to a feature vector where each feature has an intrinsic meaning and an intrinsic relationship to other features, with features being e.g. descriptions in natural language; see figure 18.

Detailed description of a vehicle control system comprising a video camera, a semantic segmentation system connected to the video camera, a vehicle-to-vehicle (V2V) communication means allowing vehicle-to-vehicle communication for exchanging object class definitions/representations

The identification of yet unknown objects (out-of-domain detection) will be performed by the uncertainty detector on the device, e.g. a sensor where our technology is integrated with the neural network performing a computer vision task. There, the uncertainty detector will record the sensor input data corresponding to the unknown object, together with the matrix of plausibilized uncertain pixels or points as labels. For vehicle-to-vehicle plausibilization, the device implementing the uncertainty detector needs vehicle-to-vehicle connectivity, provided by other in-vehicle systems. The creation of the dataset of new object clas- ses can be performed either by the uncertainty detector on the device or in the cloud. Therefore, the device needs vehicle-to-infrastructure communication, see figure 13. Detailed description of domain adaptation

Domain adaption serves for extending the operation domain of the perception system. Domain adaptation preferably includes out-of-domain detection, for instance by way of identification of objects belonging to a yet unknown object class. Domain adaptation preferably includes an adaptation or creation of a segmenting neural network so objects of one or more new object classes can be predicted by the segmenting neural network.

Figures 15 to 19 illustrate domain adaptation and in particular the process of enabling a segmenting neural network to predict objects of a new object class. Figures 18 and 19 illustrate a continuous domain adaptation. Domain adaptation can be performed either by the uncertainty detector on the device or in the cloud, see figures 16 and 17. This can happen by means of various, well-known techniques, in the simplest case by means of re-training the trained segmenting neural network with the newly determined object classes from the dataset of new object classes. In this case, re-training of the segmenting neural network means updating the file containing the values of the weights in a persistent memory. During re-training, this file is updated, and the updated file can be loaded by the device, e.g. in the next on-off-cycle of the device. If re-training happens in the cloud, the updated file will be downloaded from the cloud to the persistent memory of the device by the software-update function of the device.

The operational design domain is the domain in which the system and in particular the perception subsystem can operate reliably. The operational design domain is inter alia defined by the object classes the perception subsystem can discriminate.

Domain adaptation means that for instance the perception subsystem is newly configured or updated to be capable of discriminating further object classes that occur in the environment the perception subsystem is used in. Data configuring the neural network - i.e. data defining the weights and the activation functions of the nodes in the layers and of the filter kernel arrays- define a model that is represented by the neural network. "Downloading a model" thus means downloading configuration data for the neural network so as to define a new model, e.g. a model that is capable to discriminate more object classes Updating a model with configuration data from another model

When the program instantiates a neural network, this neural network is usually uninitialized. There are then two modes of operation. When the neural network shall be trained, the program will instantiate a neural network model from a software library, and its weights will here be initialized with random values, and the training procedure will gradually adapt the values of these weights. When the training is over, the values of the weights of the neural network will be saved in a file that corresponds to the architecture (structure, topology) of the neural network and to a generic storage file format, depending on the software library used for implementing the neural network, such like TensorFlow (by Google), PyTorch (by Facebook), Apache MXNet (by Microsoft), or the ONNX format (Open Neural Network Exchange). The established term for this file is “checkpoint”. The checkpoint file represents the model the neural network is trained for. When the neural network shall then be used for prediction, the program will again instantiate a neural network model from that software library, but for prediction, its weights will not be initialized with random values, but with the stored values forthe weights from the checkpoint file. This neural network can then be used for prediction. The checkpoint file always comprises data, which will be loaded by the program on every start of an on-off-cycle of the system.

The checkpoint file comprises data configuring the neural network - i.e. data defining the weights and the activation functions of the nodes in the layers and of the filter kernel arrays- define a model that is represented by the neural network. "Downloading a model" thus means downloading the checkpoint file containing configuration data for the neural network so as to define a new model, e.g. a model that is capable to discriminate more object classes.

When the neural network is adapted, a program instantiates a neural network model, which - this time - is not initialized with random data for training, but with the values of weights from the checkpoint file. Then the training procedure is started with the new domain adaptation dataset. Accordingly, this training does not start from scratch but from a pre-trained state. There are now two modes of adaptation. If the new object classes from the domain adaptation dataset are not similar to the object classes that the segmenting neural network already knows, space has to be provided for new object classes to be learned, which means that the architecture of the neural network must be changed to provide as many additional feature score maps (and hence additional convolutional kernels with their weights) forthe last layer as are expected to be necessary to learn the new object classes. In this case, only the weights that were present in the architecture before can be initialized from the checkpoint, and the new convolutional kernels will be initialized with random data. If, on the other hand, the new object classes are so similar to the object classes already known that it can be expected that they will generalize to the object classes already known, there is no need to change the architecture of the neural network. For training, in both cases, the val- ues of the weights of most layers in the neural network would be frozen (i.e. not amended) and only the last few layers would be trained for the adaptation. To update the system with the new segmenting neural network, a new checkpoint file is provided. In most cases, there will be no need for updating the software library because most libraries can instantiate a neural network model in a variable way, e.g. regarding the number of object classes. A software library stores these parameters within the checkpoint file, so there is no change of the program required.

Additional training can be achieved by means of few shot learning (i.e. limited number of training data sets for a new object class).

There are cases, were the segmenting neural network 40 is not directly accessible, i.e. in an autonomous driving system comprising a (primary) segmentation system.

In such case, a separate autonomous device 10', for instance a mobile device such as a smartphone, is provided in addition to the primary segmentation system.

The autonomous device 10' comprises a sensor, for instance a camera 12', for generating an input image pixel matrix, a segmenting neural network 40' (for instance a segmenting Baysian neural network) and an uncertainty detector 60' that is configured to determine regions with matrix elements (corresponding to pixels of the input image pixel matrix) that exhibit an uncertainty score above a threshold; see figure 20. The uncertainty score map is composed of elements that correspond the pixels of the input image pixel matrix (66), each element of the uncertainty map having an uncertainty score that is determined by the uncertainty detector 60' and that reflects an amount of uncertainty involved in a class prediction for a corresponding element in the segment map 72' generated by the autonomous device 10'. The uncertainty score map may comprise regions with matrix elements that exhibit an uncertainty score above a threshold and thus represent edge cases for the object classification and image segmentation. The autonomous device 10' can be used stand alone or as a second segmentation system. Preferably, the autonomous device's 10' segmenting neural network 40' is trained with the same classes as the segmenting neural network 40 of the primary image segmentation system.

A (primary) image segmentation system of an autonomous driving system typically does not provide means for uncertainty detection while the autonomous device 10' comprises an uncertainty detector 60'. Accordingly, the autonomous device 10' can act as an a safety companion that can generate a user perceivable warning signal when a region in the uncertainty score map that is composed of elements having a high uncertainty score is found in a region of interest of the input image pixel matrix that represents the street in front of a vehicle. A region in the uncertainty score map that is composed of elements having a high uncertainty score typically represents an unknown object and thus an edge case with respect to object recognition.

The autonomous device may comprise a new object detector 80 that is operatively connected to the uncertainty detector 60'. The new object detector 80 is configured to find a region in the uncertainty score map that is composed of elements having a high uncertainty score and generate a new object class for such found region.

The autonomous device 10' can be configured to exchange labeled data sets comprising segments assigned to a newly generated object class with other autonomous devices 10' to thus increase the number of known object classes available for semantic segmentation by the autonomous devices.

The autonomous device may even record a user's (e.g. a driver's) reaction in response to a warning signal of the autonomous device. The type of reaction (or the absence of any reaction) can be used as input data for discriminating relevant unknown objects from less relevant unknown objects. The type of reaction (or the absence of any reaction) can also be used for automatically generating a label for a new object class. An emergency stop or an evasive maneuver as a driver's reaction indicate a high relevance of the unknown object. An absence of a user's reaction indicates a low relevance. An automatically generated label may reflect the degree of relevance of a new object class.

The autonomous device 10' can learn to discriminate unknown objects by automatically generating new object classes. Furthermore, the autonomous device 10' can learn a relevance level for each (new) object class and thus can adopt the warning signal level to the object class. If the autonomous device 10' determines the presence of an object (by way of image segmentation) in a region of interest of the input image pixel matrix, the warning signal can be generated in dependence of the relevance of the recognized object.

The autonomous device 10' can be configured to exchange data sets comprising data representing the relevance level of a known object class or an observed user reaction (ob- served behavior) with other autonomous devices.

For exchanging data sets with other autonomous devices, the autonomous device may comprise a data interface 82, in particular a wireless data interface 82 that allows data exchange over e.g. the internet.

The autonomous device 10' preferably is a pocketable, mobile device a user easily can carry and that can easily be mounted for instance to a windshield of a vehicle. Preferably, the autonomous device 10' is mounted in a position where the autonomous device's viewing angle at least in part corresponds to the viewing angle of the sensor 12 or sensors of the autonomous driving system.

The autonomous device may comprise a segmenting neural network 40' that implements a semantic segmentation model, which is preferably continuously trained with the output of the new object detector 80 as input to thus enable the neural network to predict new objects encountered before.

In a preferred embodiment, the output of the new object detector 80 is saved as a label of a dataset that further comprises the corresponding input image pixel matrix and thus suits as a training data set, see figures 21, 22 and 23.

The training data set can be transmitted to other autonomous devices over the internet, using the mobile data interface 82 of the autonomous device 10', i.e. the mobile phone.

The autonomous device 10' preferably is adapted for training and thus updating the semantic segmentation model implemented by the segmenting neural network 40' using train- ing data sets comprising new objects as received from other autonomous devices. Thus, the autonomous device 10' can be enabled to predict edge cases encountered before by other, similar autonomous devices, see figure 24.

Preferably, the segmenting neural network 40' of the autonomous device 10' implements a model which is trained with the output (i.e. the training dataset generated by the new object detector) of the new object detector 80 as input, and with the recorded user reaction as secondary input, in order to learn to predict the user's reaction when encountering the particular new object. The source 84 providing the data representing the action performed by the user (i.e. the driver) can be derived from either a signal from the vehicle communica- tions bus in case the autonomous device is connected to the vehicle communications bus or from an algorithm or a model predicting the action from the motion of the vehicle as it is sensed by the sensors of the autonomous device 10', for instance the camera 12' of the autonomous device 10', e.g. by means of Optical Flow. A user's reaction when encountering the particular new object can be used for generating a new object class and a label for the new object class.

The training data set together with data representing the action performed by the user can be transmitted to other autonomous devices over the internet, using the mobile data interface 82 of the autonomous device 10'.

In a particular use case, the autonomous device 10' can be mounted behind the windshield of a vehicle with the autonomous device's camera facing in the direction of driving while executing a program that comprises an autonomous driving stack, connected to the output of the camera of the autonomous device as input, and providing a trajectory determined by the autonomous driving stack to the autonomous driving system of the vehicle over a connection between the autonomous device and the vehicle communications bus. In a further use case, the autonomous device 10' can be mounted behind the windshield of a vehicle with the camera facing in the direction of driving, executing a program that implements a prediction system of an autonomous driving stack, connected to the output of the camera of the autonomous device as input, and providing an object list determined by a perception system to the autonomous driving system of the vehicle over a connection be- tween the autonomous device and the vehicle communications bus.

On the pixel level, the output of new object detector 80 is a label, with the so unknown object is labelled as a new object, and everything else labelled as background and assigned to an ignore class. The sensor input together with the label is a training dataset. Elements assigned to the ignore class will not cause a loss when the loss-function is applied during training of the semantic segmentation model with the training data set comprising the new object. Thus, the training data set generated by the autonomous device 10' can be easily shared with other devices, without having to shift a massive amount of data since the training data set only comprises data that is relevant with respect to the new object class.

The autonomous device 10' can easily receive training data sets for new object classes from other autonomous devices as an input, see figure 25.

Further, false positive new objects (i.e. objects that are not new but already known but may have caused a high uncertainty score on the pixel level due to other circumstances) can be identified due to context by means of determining the uncertainty of the detected new object when inserted into the context of another autonomous driving system or another autono- mous device that encountered a similar situation.

False positive new objects can be avoided due to context by means of determining the uncertainty of the new object when inserted into a known context by the autonomous driving system or another autonomous device.

Another aspect is a perception system 90 were a secondary system 92 and a primary sys- tern 94 of a vehicle are combined to form a system that can determine regions with elements exhibiting a high uncertainty level even in matrices provided by the primary system. This is important because the primary system can be a proprietary system and thus a black box that in inaccessible from the outside. Accordingly, the primary system may not be modified for generating variance e.g. by means of Monte Carlo dropout to thus determine un- certainty scores. The additional secondary system may, however, be configured similar to the autonomous system 10' disclosed herein before and thus is capable of identifying regions with elements exhibiting a high uncertainty score.

The primary system 94 comprises a segmenting neural network 40 and an object list generator 96 that generates a list of objects corresponding to the segments generated by the segmenting neural network 40. The objects are aligned, associated, fused and managed in a sensor fusion system 98. The sensor fusion system 98 is implemented by a Bayesian filter, which is a probabilistic robotics algorithm. This algorithm indicates its level of uncertainty to its successor, e.g. by means of the Kalman Gain. Based on this uncertainty, the successor will choose either the sensor input or the prediction made by the Bayesian filter. The primary system 94, however, is not Bayesian, so the Bayesian filter in the sensor Fusion system 98 assumes a static uncertainty, calibrated only once for each sensor. With the secondary system 92, the perception system 90 is enabled to provide uncertainty, see figure 26.

The secondary system 92 preferably comprises multiple parallel redundant uncertainty detectors 60, e.g. a Bayesian neural network, a variational autoencoder, an object detector 80 trained with edge cases already encountered, and a new object detector 80 trained with new objects encountered but to be suppressed.

Preferably, the output of multiple parallel redundant uncertainty detectors in a parallel redundant architecture is matched on a per pixel basis, e.g. by the per pixel sum or by the per pixel maximum over the uncertainties or by means of a Bayesian filter, e.g. a particle filter.

Preferably, a semantic segmentation model is trained with a new object training dataset as input to the semantic segmentation model and the output the semantic segmentation model and to feed the outputs of one or multiple parallel redundant uncertainty detectors as input to a supervisor model that is trained to predict a semantic segmentation with two classes, indicating if an entire segment is considered correctly or incorrectly predicted.

Preferably, aleatoric uncertainty with the segment boundaries in a semantic segmentation is matched by means of a rule-based system.

Preferably, pixel uncertainties are aggregated into segment uncertainties, e.g. by means of a threshold over a sum or a variance over pixel uncertainties. In another preferred embodiment, a semantic segmentation model is trained with a new object training dataset as input to the semantic segmentation model and to feed the output of the semantic segmentation model and the outputs of one or multiple parallel redundant uncertainty detectors as input to a supervisor model that is trained to predict a semantic segmentation with two classes, indicating if an entire segment is considered correctly or incorrectly predicted.

In order to train a Bayesian segmenting neural network 40 with training datasets from a segmenting neural network with an unknown structure by means of transfer learning, a system as shown in figure 27 may be provided. The system is configured to determine the loss between a segment map 54A generated by the unknown segmenting neural network from the input image pixel matrix of a training data set and a segment map 54B provided with the training dataset. The segment map 54B provides the labels for the input image matrix as generated by the segmenting neural network that has generated the training dataset. The system shown in figure 27 is configured to determine the loss (determined by loss function 76) between the segment map 54A generated by the unknown segmenting neural network and the segment map 54B belonging to the training dataset. A high loss indicates where the labels provided with the training dataset differ from the labels (i.e. segments) generated by the unknown segmenting neural network. Elements or pixel that exhibit a high loss can be assigned to an ignore class to thus avoid that these image parts can affect the training of the Bayesian segmenting neural network by way of transfer learn- ing with a training dataset as for instance generated by an autonomous system as disclosed above. In other words, elements assigned to the ignore class by means of loss function 76 will be ignored by the loss function 78 that is used for training the Bayesian segmenting neural network 40.

The Bayesian segmenting neural network 40 can, for instance, be the segmenting neural network of the autonomous device 10' while the unknown trained segmenting neural network can be the segmenting neural network of an autonomous driving system.

Figure 28 illustrates the training of a variational autoencoder 62', by way of transfer learning with a dataset as input to a (known or unknown) trained segmenting neural network. The dataset is used as input to the variational autoencoder. However, the dataset used as input forthe variational autoencoder is modified by computing a loss forthe pixels predicted by the trained segmenting neural network and assigning pixels with a high loss to the ignore class when training the variational autoencoder.

In other embodiments, the variational autoencoder 62' may implement a known model of a trained segmenting neural network. Figure 29 illustrates uncertainty detection by means of variational autoencoder 62' (as a generative neural network) in order to determine pixels with a high loss due to uncertainty. In case an input dataset (for instance an input image pixel matrix) is used as input to a trained neural network e.g. for semantic segmentation that comprises an unknown object, the pixels representing the unknown object should exhibit a high uncertainty score. The input dataset 66 is fed as an input to the variational autoencoder 62' and to a loss function 80. The prediction 82 of the variational autoencoder 62' is also fed to the loss function 80 in order to determine the loss between the input image pixel matrix data set 66 that may comprise data representing an unknown object and the prediction 82 (output data set) of the variational autoencoder 62'. Thus, a loss for the pixels predicted by the trained neural network can be computed and an uncertainty score map 74' can be generated accordingly. In figure 22, the uncertainty score map 74' comprises a segment with pixels having a high uncertainty score where pixels of the input image pixel matrix 66 represent an unknown object (yellow box on figure 29).

For training the variational autoencoder (see figure 28) pixels with a high loss can be ignored. In a preferred embodiment, the variational autoencoder used for determining uncertainty (i.e. pixels with a high loss) is configured as a Bayesian neural network that applies variation for instance by way of Monte Carlo drop out. Thus, the reliability of the loss deter- mined with the help of the variational autoencoder can be determined by way of variational inference.

Figure 30 illustrates the use of one-shot learning in order for an uncertainty detector to suggest the most likely class for a pixel belonging to an unknown object by similarity to the known classes. This approach can also be used for generating an automatically labeling new object classes.

Figure 31 illustrates that multiple parallel redundant uncertainty detectors 60 can be executed in a parallel redundant architecture, with all uncertainty detectors 60 sharing one common encoder network.

In figure 32 an alternative device 100 for object recognition is illustrated. The device 100 can be a smartphone that can be mounted behind the wind shield of a vehicle or held in hand. The device 100 can be equipped with one or more cameras 102 for generating a video stream 104 that is fed to a neural engine 106 or a similar processing unit. The neural engine 106 is configured to generate an object list 108. For feeding the object list 108 to further devices, an output terminal 110 is provided. The output terminal can for example be a USB terminal (Universal Serial Bus) or an Ethernet terminal. The output terminal 110 can also be a wireless terminal using wireless local area network (WLAN, WiFi) protocol or a Bluetooth interface.

Also for receiving a video stream, input terminals may be provided in addition or as an alternative to the camera or cameras 102. The input terminal may be a universal serial bus terminal (USB-Terminal) an Ethernet terminal or a wireless local area network terminal or a Bluetooth terminal. Such input terminal enables the device 100 to receive a video stream from one or more external cameras, devices or smartphones connected to the device 100. The device 100 can be configured to generate (by means of neural engine 106) and providing the object list to further devices. The object list 108 may comprise positions with two dimensional or three dimensional coordinates of objects preferably, the coordinates localize an object in the camera coordinate system. Further, each position is preferably anno- fated with a class, e.g. class of a recognized object. The position may be annotated with a time interval and the position may be further annotated with an uncertainty score. Accordingly, the list of objects and positions 108 comprises for each recognized object an identifier for the object, coordinates that identify the position of the recognized object in the camera coordinate system, a time stamp providing information about the time interval when the recognized object was at the position identified by the coordinates and an uncertainty score that provides information about how reliable the object recognition is with respect to the particular recognized object.

The position of the recognized object can for instance be encoded by a polygon around the object in polar coordinates. Since the objects are recognized in frames of a video stream it is also possible to determine how the position of an object changes from frame to frame. This allows to generate trajectories that combine positions of two dimensional or three dimensional coordinates with the optional annotations and a time in the future. Such trajectory describes where a certain object is expected to be at that time in the future. The annotated list of objects and positions can be forwarded to another, connected device by any one of the interfaces mentioned earlier, i.e. USB, Ethernet, W-Lan and/or Bluetooth.

The device 100 further may be adapted to provide a video output 112 and/or to provide acoustic signals or messages via an audio output 116. The device may further be configured to trigger start or stop of recording.

Figure 32 illustrates a configuration of the neural engine that is suitable for object recogni- tion. Neural engine 106 implements a segmenting neural network 120 with an encoder module 122 and a decoder module 124. As described here and further above, the encoder 122 comprises a down-sampling module 126 including an input layer for down-sampling an input image pixel matrix (for instance a frame of a video stream). The layers for down- sampling the input image pixel matrix may implement variational inference (as described herein further above) with partial model sampling and/or with consecutive video sampling and/or with a Lipschitz constraint for spectral normalization of the soft max values. Further, the layers for down-sampling may be configured to process simultaneous inputs or a number of different points in time. This can be achieved by a temporal shift of video frames or input image pixel matrices respectively. In addition to down-sampling, the encoder 122 also provides a feature extraction module 128. Again, the encoder layers for feature extraction may implement temporal shift and/or variational inference optionally with partial model sampling, and/or with consecutive video sampling and/or with a Lipschitz constraint for spectral normalization of the soft max values. The results of the feature extraction are provided to a feature fusion module 130 of the decoder 124 of the segmenting neural network. Feature fusion preferably is achieved by means of a Kalman feature wherein feature matching is achieved via the feature position as determined by the feature extraction module 128 of the encoder module 122. The feature tensors generated by the feature extraction module 128 of the encoder 122 capture spatial details of the input image pixel matrixes. After feature fusion by the feature fusion module 130, the classifier module 132 of the decoder classifies each pixel of the input image pixel matrix by assigning each pixel to one of the object classes, the segmenting neural network is trained for.

The segmenting neural network as shown in figure 32 can be used for object recognition and is capable of recognizing objects (i.e. determining segments with pixel that belong to a certain object class the segmenting neural network was trained for) and by encoding the segment with a polygon around the recognized object. Preferably, the polygon is annotated with a class of the recognized object.

For object recognition, the segmenting neural network of figure 32 comprises an encoder- decoder architecture and receives an image (i.e. a frame) from an input video stream. The segmenting neural network generates one feature map per object class with a soft max value of pixel-vice segment predictions. The encoder 122 of the semantic segmentation model of figure 33 comprises a down-sampling module 126 for down-sampling of the input image pixel matrix and a feature extraction module 128 for feature extraction as indicated by the dashed lines, the feature extraction module 130 may be configured to put out infor- mation regarding the context and/or information regarding spatial detail. Information regarding context and/or spatial detail can be fed to the feature fusion module 130 of the decoder to thus allow position and/or context based feature fusion.

The feature fusion module 130 of the decoder is optional and preferably is provided in case of multiple inputs from different layers of the encoder. Preferably, at least one input is di- rectly provided from the last layer of the down-sampling module 126 of the encoder 12. Further, the feature fusion module may be provided with one or more inputs from the inner layers of the feature extraction module.

Classification is performed pixel wise. Optionally, the semantic segmentation module as shown in figure 33 may receive a secondary input from a depth estimation module as illustrated in figure 37.

The semantic segmentation module (segmenting neural network) may also implement polygon regression. Further, the annotated list of objects may comprise an annotation with a relative uncertainty score indicating the reliability of the classification and potentially hinting to a yet unknown class of object in case the uncertainty score is high. Further, the object list generated by the segmenting neural network as shown in figure 33 may comprise annotations that identify a proposed similar class of object in case of a high degree of uncertainty. In a preferred embodiment of device 100, neural engine 106 is configured to extract features from one or more input image pixel matrixes, i.e. frames of video streams. The models implemented by the neural networks of the neural engine 106 may be configured to extract one or more of the following features: classes of objects (figures 32 and 33) - classes of gestures (figures 34, 35, 36 and 37) classes of actions classes of awareness (figure 38) and/or forecasts of classes of action (figures 39 and 40)

For feature extraction the neural engine 106 implements one or more neural networks that preferably have an encoder-decoder architecture.

Depending on the feature to be extracted, the input data set fed to a respective input layer of a down-sampling module of the encoder of the respective neural network depends on the feature to be extracted.

For recognizing classes of objects, a segmenting neural network implementing a semantic segmentation module as illustrated in figure 33 can be used. The input data sets are input image pixel matrixes (frames) of one or more video input streams provided by one or more cameras.

If the features to be extracted are classes of gestures signaled by a person or a vehicle, preferably encoded by a polygon around the object, a neural network implementing a video action or recognition model as illustrated in figure 35, optionally receiving a secondary input from a depth estimation model as illustrated in figure 37 can be used.

If the feature to be extracted is a class of awareness of a person or a vehicle, preferably encoded by a polygon around the object, a neural network implementing a semantic segmentation model as illustrated in figure 33 and/or a video action recognition model as illus- trated in figure 35 may be used.

For forecast of a class of action to be performed by person or vehicle in the future, preferably encoded by a polygon around the object, a neural network implementing an action forecast model as illustrated in figure 40 may be used. The model preferably is configured to receive a video input stream. Regarding the output of the respective neural network it is preferred that in case of object recognition, the semantic segmentation model preferably provides a list with classes of objects and positions; see figure 33 class of gesture recognition, the video action recognition model preferably provides one anchor per recognized object encoding a polygon around the object class of gesture of action performed by that object class of action recognition, the video action recognition model as illustrated in figure 35, preferably receiving a second input from a depth estimation model as illustrated in figure 37, provides an annotation of the class of action performed by a recognized object, the recognized object preferably being a person or a vehicle, - recognition of a class of awareness of a person or a vehicle, the neural network preferably provides an annotation of a class of awareness for persons or vehicles recognized by means of the semantic segmentation module of figure 33 and/or the video action recognition model of figure 35, and a forecast of a class of action to be performed by a person or vehicle, preferably encoded by a polygon around the object, by means of an action forecast module as illustrated in figure 40, one anchor per object encoding a polygon around the object, an indicator of the class of action to be performed by that object and a time interval for which the class of action is anticipated is generated for a respective object as recognized by the semantic segmentation module of figure 33.

When comparing the models illustrated in figure 33, figure 35, figure 37 and figure 40, it is apparent, that the models may share identical encoders having identical down-sampling modules and identical feature extraction models. Further, the models may share identical feature fusion modules of the respective decoder. However, the classification module of the respective decoder varies depending on the feature to be extracted.

This allows to implement the neural engine with a multi-head architecture as illustrated in figure 31 , where one encoder provides outputs to different decoders.

Implementation details of the models as illustrated in figure 33, figure 35, figure 37 and figure 40 can be summarized as follows:

For object recognition, preferably semantic segmentation model according to figure 33 is provided.

The model preferably implements an encoder-decoder architecture and receives an image from a video input stream. The semantic segmentation model generates one feature map per class with the softmax values of pixel-wise segment predictions.

The encoder 122 comprises a down-sampling module 126 for down-sampling of the input image pixel matrix.

The encoder 122 further comprises a feature extraction module 128 for feature extraction. The encoder preferably provides parallel instantiation for context and/or for spatial detail (see dashed lines). This means that information regarding the context or spatial detail is fed to the feature fusion module 130 to facilitate feature fusion. The decoder 124 preferably includes a feature fusion module 130 in case of multiple inputs from different layers of the encoder 122.

The feature fusion module 130 preferably receives one input directly from the last layer of the down-sampling module 126 and/or preferably one input from the last layer of the feature extraction module 128 and/ or one or multiple inputs from the inner layers of the feature extraction module 128 (instantiation of context and spatial detail).

The decoder 124 preferably includes a classification module 132 for pixel-wise classification.

The semantic segmentation model may receive a secondary input from a depth estimation model as illustrated in figure 37, with their combination outlined in figure 51 and following.

The semantic segmentation model preferably implements polygon regression.

In the semantic segmentation model as illustrated in figure 33 all dashed connections are optional. Temporal shift for improvement of the semantic predictions by means of providing simultaneous input of m different points in time (10) is added. Further preferred features are variational inference (13) with partial model sampling (12), and with consecutive video sampling (11) and a Lipschitz constraint for spectral normalization of the softmax values (14).

For class of gesture recognition, preferably a video action recognition model according to figure 35 is provided. The video action recognition model preferably implements an encoder-decoder architecture and receives a video input stream.

The video action recognition model generates one anchor per recognized object encoding a polygon around the recognized object and the class of gesture or action performed by that recognized object. The encoder 122 comprises a down-sampling module 126 for down-sampling of the input image pixel matrix.

The encoder 122 further comprises a feature extraction module 128 for feature extraction. The encoder 122 preferably provides parallel instantiation for context and/or for spatial detail (see dashed lines). This means that information regarding the context or spatial detail is fed to the feature fusion module 130 to facilitate feature fusion.

The encoder 122 preferably is configured for polygon-wise regression of the polygon and/or classification of objects and/or classification of gesture or action.

The encoder 122 preferably comprises one or more temporal shift modules as illustrated in figure 36. Such temporal shift module 140 may be inserted into the down-sampling module 126, and/or inserted into the feature extraction module 128 and/or inserted into the feature fusion module 130. The decoder 124 preferably includes a feature fusion module 130 in case of multiple inputs from different layers of the encoder 122.

The decoder 124 preferably includes a classification module 132 for generating one anchor per object encoding a polygon around the object and the class of gesture or action performed.

The video action recognition model may receive a secondary input from a depth estimation model as illustrated in figure 37, with their combination outlined in figure 51 and following.

In the video action recognition model as illustrated in figure 35 all dashed connections are optional. Optional connectors 7, 8, 9 were added for parallel instantiation of context and spatial detail. The classification module 132 is based on the head of Poly-YOLO, https://arxiv.org/abs/2005.13243. Optional temporal shift modules 140 for detection of ac- tions over time by means of providing simultaneous input of m different points in time (10) are added. Further preferred features are variational inference (13) with partial model sampling (12), and with consecutive video sampling (11) and a Lipschitz constraint for spectral normalization of the softmax values (14). The classification modules 132 of the video action recognition model as illustrated in figure 35 and the action forecast model as illustrated in figure 40 are differently trained, i.e. trained with different training data sets and thus are different.

The depth estimation model as illustrated in figure 37 preferably implements an encoder- decoder architecture.

The encoder 122 further comprises a feature extraction module 128 for feature extraction.

The encoder 122 preferably provides parallel instantiation for context and/or for spatial de- tail (see dashed lines). This means that information regarding the context or spatial detail is fed to the feature fusion module 130 to facilitate feature fusion.

The encoder 122 preferably comprises one or more temporal shift modules as illustrated in figure 36. Such temporal shift module 140 may be inserted into the down-sampling module 126, and/or inserted into the feature extraction module 128 and/or inserted into the feature fusion module 130.

The decoder 124 preferably includes a feature fusion module 130 in case of multiple inputs from different layers of the encoder 122.

The decoder 124 preferably includes a classification module 132 for depth estimation.

The depth estimation model may receive a secondary input from a semantic segmentation model as illustrated in figure 33, with their combination outlined in figure 51 and following. The video action recognition model may further generate annotations with respect to relative value of uncertainty indicating an unknown class of gesture or action and/or indicating the case where another object is mistaken for a person or vehicle. In the depth estimation model as illustrated in figure 37, all dashed connections are optional. Optional connectors 7, 8, 9 were added for parallel instantiation of context and spatial detail. The classification module 132 is based on the head of Poly-YOLO, htps://arxiv.org/abs/2005.13243. Optional temporal shift modules 140 for detection of ac- tions over time by means of providing simultaneous input of m different points in time (10) are added. Further preferred features are variational inference (13) with partial model sampling (12), and with consecutive video sampling (11) and a Lipschitz constraint for spectral normalization of the softmax values (14).

For forecasting a class of action, preferably an action forecast model according to figure 40 is provided.

The action forecast model preferably implements an encoder-decoder architecture and receives a video input stream.

The video action forecast model generates one anchor per recognized object encoding a polygon around the recognized object, the class of action to be performed by that object, and a time interval for which the class of action is anticipated.

The decoder 124 preferably includes a classification module 132 for generating one anchor per recognized object encoding a polygon around the recognized object, the class of action to be performed by that object, and a time interval for which the class of action is antici- pated.

The action forecast model may receive a secondary input from a semantic segmentation model as illustrated in figure 33, with their combination outlined in figure 51 and following. The action forecast model may receive a secondary input from a depth estimation model as illustrated in figure 37, with their combination outlined in figure 51 and following. The action forecast model may receive a secondary input from a video action recognition model as illustrated in figure 35, with their combination outlined in figure 51 and following.

The video action recognition model may further generate annotations with respect to relative value of uncertainty indicating the case where another object is mistaken for a person or vehicle. In the action forecast model as illustrated in figure 40, all dashed connections are optional. Optional connectors 7, 8, 9 were added for parallel instantiation of context and spatial detail. The classification module 132 is based on the head of Poly-YOLO, https://arxiv.org/abs/2005.13243. Optional temporal shift modules 140 for detection of actions over time by means of providing simultaneous input of m different points in time (10) are added. Further preferred features are variational inference (13) with partial model sampling (12), and with consecutive video sampling (11) and a Lipschitz constraint for spectral normalization of the softmax values (14).

The classification modules 132 of the video action recognition model as illustrated in figure 35 and the action forecast model as illustrated in figure 40 are differently trained, i.e. trained with different training data sets and thus are different. Training of the classification module 132 of the action forecast model as illustrated in figure 40 can be performed by unsupervised learning using labels for training data sets that are generated by the video action recognition model as illustrated in figure 35. For doing so, time shift is applied to learn anticipated actions for action forecasting from a sequence of previously recognized actions as recognized by the video action recognition model as illustrated in figure 35.

As indicated in figures 33, 35, 37 and 40, preferably temporal shift is implemented for detection of actions over time by means of providing simultaneous input of different points in time. For generating such temporal shift, a temporal shift module 140 as illustrated in figure 36 can be provided. The temporal shift module 140 is based on https://arxiv.Org/abs/1811 .08383. Optional support for m features in the temporal dimension instead of only one is added.

Figure 34 illustrates the use case of gesture recognition. The gestures to be recognized are “turning of head” of the recognized object “bicyclist” or “signaling turn” of the recognized object “bicyclist”. Figure 38 illustrates awareness recognition. Depending on the orientation of a bicyclists head, the recognized object “bicyclist” are provided with annotation representing a class of awareness, i.e. “unaware” or “aware” as in all other cases illustrated herein, for each class annotation (e.g. “unaware” or “aware”) an uncertainty score can be determined and assigned to the class annotation. Figure 39 illustrates action forecast for the recognized object “bicyclist”. Depending on the situation (recognized lanes and recognized non-moving object “car” a forecast for the position of the object “bicyclist” is generated.

The models described above can be used to implement functions, e.g. a. to derive controllability of a situation from the awareness of persons or vehicles of the ego vehicle b. to derive a risk estimation for each person or vehicle i. with severity

1 . based on their class

2. and/ or based on the gesture or action they perform 3. and/ or based on the action they are anticipated to perform

4. and/ or based on their protectedness

5. and/ or based on their distance to the ego vehicle

6. and/ or based on their motion towards the ego vehicle 7. and/ or based on their motion away from the ego vehicle

8. and/ or based on their acceleration

9. and/ or based on their deceleration ii. with controllability

1 . based on their awareness of the ego vehicle

2. and/ or based on the gesture or action they perform

3. and/ or based on their time available to react to behavior of the ego vehicle a. based on their distance to the ego vehicle b. and/ or based on their motion towards the ego vehicle c. and/ or based on their motion away from the ego vehicle d. and/ or based on their acceleration e. and/ or based on their deceleration iii. integrated by means of a probabilistic filter (e.g. Kalman filter) c. to derive the velocity of each object i. from the change of depth (first derivative) d. to derive the acceleration and/ or deceleration of each object i. from the change of the change of depth (second derivative) e. to derive probable trajectories of a person or vehicle, based on their probable actions likely to happen in the future f. to trigger the start or stop of recording of images or video i. on the appearance or disappearance of a class of object ii. or on the appearance or disappearance of a class of gesture iii. or on the ego vehicle entering or leaving the line of sight of a person or vehicle iv. or on the appearance or disappearance of a class of action v. or on the likelihood of an action predicted to happen within a time interval vi. or on the appearance or disappearance of an edge case, i.e. unknown class of object or action, based on the uncertainty vii. or if the prediction and the actual action differ viii. to learn a trigger

The models described above can also be used to implement use cases, e.g. a. for any action performed by a person or vehicle which the model has been trained to recognize i. generate a warning ii. and/ or annotate the video recording with the license plate of the vehicle iii. and/ or provide this information to an autonomous driving system iv. when a person, bicyclist or motorcyclist is about to cross the street

1 . generate a warning

2. and/ or provide this information to an autonomous driving system v. when a bicyclist, motorcyclist or vehicle signals a turn or lane change

1 . generate a warning

2. and/ or provide this information to an autonomous driving system vi. when a person, bicyclist or motorcyclist is unaware of the ego vehicle

1 . generate a warning

2. and/ or provide this information to an autonomous driving system vii. when a person, bicyclist or motorcyclist becomes aware of the ego vehicle

1 . generate a warning

2. and/ or provide this information to an autonomous driving system viii. when children play on the pavement or road side

1 . generate a warning

2. and/ or provide this information to an autonomous driving system ix. when emergency or police vehicles with flashing lights are present

1. e.g. blue (e.g. Germany)

2. and/ or e.g. red or orange (e.g. U.S.)

3. generate a warning

4. and/ or to provide this information to an autonomous driving system b. when a person or vehicle perform an unusual action which the model has not been trained to recognize i. generate a warning ii. and/ or annotate the video recording with the license plate of the vehicle iii. and/ or provide this information to an autonomous driving system c. for any situation in which a person or vehicle appear and which the model has been trained to recognize i. generate a warning ii. and/ or annotate the video recording with the license plate of the vehicle iii. and/ or provide this information to an autonomous driving system iv. when a construction site is present 1 . generate a warning

2. and/ or provide this information to an autonomous driving system v. recognize street signs and display the latest recognized street sign on the screen

1 . recognize the speed limit from a street sign

2. generate a warning when the actual speed deviates from the speed limit a. by a configurable amount or fraction

3. recognize the case where a street sign has an additional label a. and/ or display the fact on the screen

4. and/ or to provide this information to an autonomous driving system vi. when the clearance to an obstacle becomes too low, and/ or if this happens too fast

1 . generate a warning

2. and/ or provide this information to an autonomous driving system d. when a person or vehicle appear in an unusual situation which the model has not been trained to recognize i. generate a warning ii. and/ or annotate the video recording with the license plate of the vehicle iii. and/ or to provide this information to an autonomous driving system e. learn a particular gesture or action i. to trigger video recording ii. to dismiss warnings iii. to launch an app iv. to call a contact or to dismiss a call v. to start or stop playing media f. by a particular gesture or action i. trigger video recording

1 . e.g. a person performing a particular exercise when doing sports a. to trigger the start of recording when the particular exercise starts b. and/ or to trigger the stop of recording when the exercise ends ii. dismiss warnings iii. trigger the launch of an app on the phone iv. trigger calling a contact or to dismiss a call v. trigger the start or stop of playing media g. for any risk regarding any person or vehicle i. crossing a configurable threshold ii. generate a warning iii. and/ or display the fact on the screen iv. and/ or provide this information to an autonomous driving system Figure 41 illustrates the use case of uncertainty recognition.

For uncertainty recognition, a cascaded monitoring concept as illustrated in figure 42 may be used. The cascaded monitoring concept 150 may comprise two parallel situation monitors 152 to recognize edge cases with respect to the dataset, deriving segments from pixel-wise uncertainty, encoded by a polygon around the segment.

A primary situation monitor 152 implements a model to match the pixel-wise reconstruction loss of an autoencoder model to segments by means of a model with the same topology as the classification module of the decoder of the action recognition model. The model may be configured for matching pixel-wise epistemic uncertainty to segment boundaries with pixel-wise epistemic uncertainty provided by the segmentation model and/ or by the depth estimation model and/ or optionally by the autoencoder model.

Alternatively, the model may be configured for matching pixel-wise aleatoric uncertainty to segment boundaries to recognize the case where multiple overlapping segments have the same class and there is not enough epistemic uncertainty.

Figure 43 illustrates a situation monitor 154 based on the head of Poly-YOLO, https://arxiv.org/abs/2005.13243^').

The situation monitor 154 implements an autoencoder model 160 to provide the recon- struction loss.

The model preferably implements an encoder-decoder architecture and receives an image from a video input stream.

The semantic segmentation model generates one feature map per class with the softmax values of pixel-wise segment predictions. The encoder 122 comprises a down-sampling module 126 for down-sampling of the input image pixel matrix. The encoder 122 further comprises a feature extraction module 128 for feature extraction.

The encoder preferably provides parallel instantiation for context and/or for spatial detail (see dashed lines). This means that information regarding the context or spatial detail is fed to the feature fusion module 130 to facilitate feature fusion. The decoder 124 preferably includes a feature fusion module 130 in case of multiple inputs from different layers of the encoder 122.

The decoder 124 preferably includes a classification module 132 for reconstruction of the input image and computation of a pixel-wise reconstruction loss.

Figure 44 illustrates an autoencoder model 160. All dashed connections are optional. Optional connectors 7, 8, 9 were added for parallel instantiation of context and spatial detail. Depth estimation module 132 is independently designed with symmetry to the classification module. Optional temporal shift modules 140 for detection of actions over time by means of providing simultaneous input of m different points in time (10) are added. Further preferred features are variational inference (13) with partial model sampling (12), and with consecutive video sampling (11) and a Lipschitz constraint for spectral normalization of the softmax values (14).

The other situation monitor 152 implements a model to match the pixel-wise reconstruction loss of an autoencoder model to segments.

The other situation monitor 152 comprises two parallel validity monitors 156 and 158. One validity monitor is a Bayesian validity monitor 158 comprising - dropout layers, inserted into convolutional submodules, for variational inference, deriving uncertainty by means of sampling under dropout and a sampler for sampling the model with partial model sampling and/or with consecutive video sampling. Matching the edge cases found to the segments can be found by means of a probabilistic filter (e.g. Kalman filter).

Figure 45 illustrates a convolutional submodule

Figure 46 illustrates a convolutional submodule with dropout. This module can be based on htps://arxiv.Org/abs/1703.04977 and htps://mlg.eng.cam.ac.uk/varin/thesis/thesis.pdf.

Figure 47 illustrates a Bayesian sampling module

The other validity monitor 156 can be a Lipschitz validity monitor, inserted into convolutional submodules, normalizing softmax to correlate with uncertainty optionally provided to the situation monitor by means of the softmax value and/ or optionally provided to the situation monitor the softmax variance.

Figure 48 illustrates a Lipschitz submodule. This module can be based on https://arxiv.org/abs/2102.11582 and https://arxiv.org/abs/2106.02469.

Uncertainty can be provided to probabilistic filters.

Figure 49 illustrates an integration of the situation monitor with a Kalman filter Preferably, similar classes in case of uncertainty are predicted.

Figure 50 illustrates a use case of similarity prediction

Preferably, all or a subset of the models used in the extraction of features are integrated within a single (multi-head) model, consisting of one or more shared encoders. The multihead model may comprise a segmentation head and/ or a depth estimation head and/ or an autoencoder head and/ or a video action recognition head and/ or an action forecast model.

Figure 51 is an overview over a multi-functional multi-head model

Figure 52 illustrates an encoder of a multi-functional model. All dashed connections are optional. Temporal shift (10), variational inference (13) with partial model sampling (12), and with consecutive video sampling (11) and a Lipschitz constraint (14), have been added, each one optionally.

Figure 53 illustrates a segmentation head (i.e. classification module) of a multi-functional model. All dashed connections are optional. A temporal shift, a Lipschitz constraint and dropout for variational inference to all layers of the decoder and a Lipschitz constraint to all layers of the situation monitor have been added (each one optionally). The situation monitor processes input from the Bayesian variance of variational inference (optionally), and/ orthe softmax values normalized by the Bi-Lipschitz constraint (optionally), and/ or the softmax variance normalized by the Bi-Lipschitz constraint (optionally). Figure 54 illustrates a video action recognition head (i.e. classification module) of a multifunctional mode. All dashed connections are optional. A temporal shift, a Lipschitz constraint and dropout for variational inference to all layers have been added (each one optionally).

Figure 55 illustrates a depth estimation head (i.e. classification module) of a multi-func- tional model. All dashed connections are optional. A temporal shift, a Lipschitz constraint and dropout for variational inference to all layers of the decoder and a Lipschitz constraint to all layers of the situation monitor have been added (each one optionally). The situation monitor processes input from the Bayesian variance of variational inference (optionally), and/ or the softmax values normalized by the Bi-Lipschitz constraint (optionally), and/ or the softmax variance normalized by the Bi-Lipschitz constraint (optionally).

Figure 56 illustrates an autoencoder head (i.e. classification module) of a multi-functional model. All dashed connections are optional. A Lipschitz constraint is added to all layers of the decoder and to all layers of the situation monitor (each one optionally). The situation monitor processes input from the pixel-wise reconstruction loss of the autoencoder head (optionally), and/ or the softmax values normalized by the Bi-Lipschitz constraint (optionally), and/ orthe softmax variance normalized by the Bi-Lipschitz constraint (optionally).

Figure 57 illustrates an action forecast head (i.e. classification module) of a multi-functional model. All dashed connections are optional. A temporal shift, a Lipschitz constraint and dropout for variational inference to all layers have been added (each one optionally). Figure 58 illustrates how the system can be implemented following the Sense-Plan-Act concept. Perception, planning and control subsystems shall each have their own hardware. The perception systems, in particular, shall be integrated with their sensor hardware. The map shall be instantiated on the hardware of the planning subsystem.

The sensors preferably comprise six vision sensors for the near range, each 120 degrees field of view, each 40 degrees overlapping and each instantiating one separate perception subsystem; see figure 59. The sensors preferably comprise one vision sensor with 60 degrees field of view, instantiating one separate perception subsystem for the medium range.

List of reference numerals

5 optional connector

6 optional connector

7 optional connector

8 optional connector

9 optional connector

(10) Temporal Shift

(11) Consecutive video sampling

(12) Partial model sampling

(13) Variational inference

(14) Lipschitz constraint

10 vision system 10 autonomous device

12 12 camera

14 data processing unit

16 memory

18 communication interface

20 display unit

22 output data interface

24 sensor fusion system

40, 40’ segmenting neural network

42 encoder part

44 decoder part

46 convolutional layers

48 pooling layers

50 feature map

52 feature score map

54 segment map

54 A, 54B segment map

56 segment

58 segment

60, 60’ uncertainty detector 62 generative neural network

62’ variational autoencoder

64 tensor library

66 image pixel matrix 68 68 label

70 loss function

72,72’ segment map 74, 74’ uncertainty score map 76 loss function 78 loss function

80 loss function 82 prediction of a generative neural network 90 unknown object (box) 92 known object (street marking) 100 device

102 camera 104 video stream 106 neural engine 108 object list 1 10 output terminal 112 video output 116 audio output 120 segmenting neural network 122 encoder 124 decoder

126 block 128 block 130 feature fusion layer 140 temporal shift module 150 cascaded monitoring concept 152 situation monitor 154 situation monitor 156 validity monitor 158 validity monitor

Bibliographical List

Smith, Lewis et al.: Can convolutional ResNets approximately preserve input distances? A frequency analysis perspective; 17 June 2021 (https://arxiv.org/abs/2106.02469^')

Mukhoti; Jishnu et al.: Deterministic Neural Networks with Inductive Biases Capture Epis- temic and Aleatoric Uncertainty; 8 June 2021

(https://arxiv.org/abs/2102.11582)

An, Shan et al.: Real-Time Monocular Human Depth Estimation and Segmentation on Em- bedded Systems; 24 August 2021

Lin, Ji et al.: TSM: Temporal Shift Module for Efficient Video Understanding; 22 August 2019

(https://arxiv.0rg/abs/1811.08383)

Hurtik, Petr et al.: POLY- YOLO: HIGHER SPEED, MORE PRECISE DETECTION AND INSTANCE SEGMENTATION FOR YOLOV3; 29 May 2020

(htps://arxiv.org/abs/2005.13243)

Poudel, Rudra PK et al.: Fast-SCNN: Fast Semantic Segmentation Network; 12 February 2019

(htps://arxiv.org/abs/1902.04502) Gal, Yarin: Uncertainty in Deep Learning; September 2016 (http://mlq.enq.cam.ac.uk varin/thesis/thesis.pdf)

Kendall, Alex et al.: What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?; 5 October 2017 (https://arxiv.Org/pdf/1511 .02680v2.pdf) Kim, Wonjik et al.: Unsupervised Learning of Image Segmentation Based on Differentiable Feature Clustering; 20 July 2020 (https://arxiv.org/pdf/2007.09990.pdf)

Ghahramani, Zoubin: Probabilistic machine learning and artificial intelligence; 28 May 2015 (https://www.repository.cam.ac.Uk bitstream/handle/1810/248538/Ghahramani%202015 %20Nature.pdf:isessionid=0492F830B1DAA2A6A1AE004766B0E064?sequence=1)

Goodfellow, Ian et al.: Deep Learning, Chapter 6; 18 November 2016 (https://www.deeplearninqbook.orq/contents/mlp.htmn

Goodfellow, Ian et al.: Deep Learning, Chapter 5; 18 November 2016

(https://www.deeplearninqbook.org/contents/ml.htmr) Goodfellow, Ian et al.: Deep Learning, Chapter 9; 18 November 2016 (https://www.deeplearninqbook.org/contents/convnets.html^') Jadon, Shruti: An Overview of Deep Learning Architectures in Few-Shot Learning Domain; 19 August 2020

(https://arxiv.org/pdf/2008.06365.pdf)

Li, Shen et al.: PyTorch Distributed: Experiences on Accelerating Data Parallel Training; 28 June 2020

(https://arxiv.org/pdf/2006.15704.pdf)

Schnieders, Benjamin et al.: Fully Convolutional One-Shot Object Segmentation for Industrial Robotics; 2 March 2019

(https://arxiv.org/pdf/1903.00683.pdf)

Claims

1 . Method for generating new object classes, said method including the steps of: detecting a region in the uncertainty score map that is composed of elements having a high uncertainty score, said region representing a first unknown ob- ject generating a tentative new object class for said unknown object and automatically generating a label and detecting a further region in a further uncertainty score map that is composed of elements having a high uncertainty score, said region representing a further unknown object determining a similarity between the first and the further if the similarity exceeds a predetermined threshold, generating a non-tentative new object class.

2. Method according to claim 1 , wherein generating a non-tentative new object class includes one-shot learning or few shot learning using samples that represent the first unknown object and the further unknown object.

3. System comprising a primary segmentation system and a separate autonomous device, wherein the primary segmentation system comprises a primary perception system comprising a segmenting neural network (40) that is configured and trained for segmentation of an input image pixel matrix (66) to thus generate a segment map (54; 72) composed of elements that correspond to the pixels of the input image pixel matrix (66), each element of the segment map (54, 72) being assigned to one of a plurality of object classes the segmenting neural network (40) is trained for by way of class prediction, elements being assigned to the same object class forming a segment of the segment map (54, 72), and wherein the autonomous device (10') comprises a sensor for generating an input image pixel matrix comprised of matrix elements, a segmenting neural network and an uncertainty detector (60') that is configured to generate an uncertainty score map (74) composed of matrix elements that correspond the pixels of the input image pixel matrix (66), each matrix element of the uncertainty map having an uncertainty score that is determined by the uncertainty detector (60') and that reflects an amount of uncertainty involved in a class prediction for a corresponding element in the segment map (72) and wherein the autonomous device (10') further comprises signaling means that are configured to generate a user perceivable signal in case the uncertainty map generated by the uncertainty detector (60') of the autonomous device (10') comprises regions comprised of matrix elements that exhibit an uncertainty score above a threshold and thus represent edge cases for the object classification and image segmentation.

4. Autonomous device (10') comprising a sensor for generating an input image pixel matrix comprised of matrix elements, a segmenting neural network and an uncertainty detector (60') that is configured to generate an uncertainty score map (74) composed of matrix elements that correspond the pixels of the input image pixel matrix (66), each matrix element of the uncertainty map having an uncertainty score that is determined by the uncertainty detector (60') and that reflects an amount of uncertainty involved in a class prediction for a corresponding element in the segment map (72) and the autonomous device (10') further comprising signaling means that are configured to generate a user perceivable signal in case the uncertainty map generated by the uncertainty detector (60') of the autonomous device (10') comprises regions comprised of matrix elements that exhibit an uncertainty score above a threshold and thus represent edge cases for the object classification and image segmentation.

5. Autonomous device according to claim 4, further comprising a new object detector that is operatively connected to the uncertainty detector of the autonomous device and that is configured to find a region in the uncertainty score map that is composed of elements having a high uncertainty score and generate a new object class for such found region.

6. Autonomous device according to claim 4 or 5, wherein the autonomous device is configured to exchange labeled data sets comprising segments assigned to a newly generated object class with other autonomous devices to thus increase the number of known object classes available for semantic segmentation by the autonomous devices.

7. Autonomous device according to at least one of claims 4 to 6, wherein the autonomous device is configured to record a reaction in response to a warning signal emitted by the autonomous device.

8. Autonomous device according to at least one of claims 4 to 7, wherein the autono- mous device is configured for learning to discriminate unknown objects by automatically generating new object classes.

9. Autonomous device according to claim 8, wherein the autonomous device is configured for automatically generating a label for a new object class based on a user's reaction in the context of encountering a yet unknown object.

10. Autonomous device according to at least one of claims 4 to 9, wherein the autonomous device is a pocketable, mobile device a user easily can carry and that can easily be mounted for instance to a windshield of a vehicle.

11 . Autonomous device according to at least one of claims 4 to 10, further comprising a segmenting neural network that implements a semantic segmentation model, which is preferably continuously trained with the output of the new object detector as input to thus enable the neural network to predict new objects encountered before.

12. System comprising an input for a sequence of input image pixel matrixes (66) derived from an input video stream and a perception system comprising a segmenting neural network (40) that is configured and trained for segmentation of an input image pixel matrix (66) to thus generate a segment map (54; 72) composed of elements that correspond to the pixels of the input image pixel matrix (66), each element of the segment map (54, 72) being assigned to one of a plurality of object classes the segmenting neural network (40) is trained for by way of class prediction, elements being assigned to the same object class forming a segment of the segment map (54, 72), said system being configured for generating from the segments of the segment map a list of objects, encoded by a polygon around each object and annotated with the class of the object, said polygon characterizing the position of the corresponding segment with respect to the input image pixel matrix.

13. System according to claim 12, wherein the system comprises an interface for forwarding the list of objects to another device.

14. System according to claim 12 or 13, wherein the perception system implements a video action recognition model, combining a multi-path segmentation encoder (122) with a Poly-Yolo head (132), said video action recognition model comprising a tem- poral shift module (140).

15. System according to claim 14, wherein the perception system implements a video action recognition model comprising an feature fusion module (130) with inputs from layers of a down-sampling module (126) and/or a feature extraction module (128) of the encoder (122) of the video action recognition model.

16. System according to at least one of claims 12 to 15, configured for forecasting actions performed by persons or vehicles, receiving an input video stream from the cameras and sending a list of objects, encoded by a polygon around the object and annotated with the class of anticipated action, over USB, Wi-Fi, Bluetooth, Ethernet, on a smartphone or a device.

17. System according to claim 16, implementing an action forecast model that combines a multi-path segmentation encoder with a Poly-Yolo classification module of the decoder, further implementing temporal shift temporal shift and/or variational inference with partial model sampling and/or with consecutive video sampling and/or a Lip- schitz constraint.