EP4341913A2 - System zur erkennung und verwaltung von unsicherheit in wahrnehmungssystemen, zur erkennung neuer objekte und zur situationsvorausnahme - Google Patents

System zur erkennung und verwaltung von unsicherheit in wahrnehmungssystemen, zur erkennung neuer objekte und zur situationsvorausnahme

Info

Publication number
EP4341913A2
EP4341913A2 EP22730378.1A EP22730378A EP4341913A2 EP 4341913 A2 EP4341913 A2 EP 4341913A2 EP 22730378 A EP22730378 A EP 22730378A EP 4341913 A2 EP4341913 A2 EP 4341913A2
Authority
EP
European Patent Office
Prior art keywords
uncertainty
neural network
map
class
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP22730378.1A
Other languages
English (en)
French (fr)
Inventor
Ralph Meyfarth
Sven Fuelster
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Deep Safety GmbH
Original Assignee
Deep Safety GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Deep Safety GmbH filed Critical Deep Safety GmbH
Publication of EP4341913A2 publication Critical patent/EP4341913A2/de
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle

Definitions

  • the invention refers to a system for detection and management of uncertainty in perception systems. In further embodiments the invention refers to a system for new object detection and/or for situation anticipation based on detected uncertainty.
  • Computer vision that can for instance be used in perception systems.
  • deep neural networks learn to recognize classes of objects in images.
  • Computer vision includes simple classification, in which a whole image is classified with the object class of its most dominant object; object detection, in which a neural network predicts bounding boxes around the ob- jects in the image; semantic segmentation, in which every pixel of an image is classified (labeled) with the object class of the object to which it belongs; optical flow, which predicts a field of vectors of movement for the objects shown; as well as derived methods like instance segmentation, panoptic segmentation, video segmentation and more.
  • the neural network needs to be trained by means of a training dataset in order to learn classes of objects or features to predict.
  • this training dataset comprises pairs of an input image and a corre- sponding label with the desired result (i.e. the object class represented in the input image).
  • a label can be a simple class name for the whole image in the case of classification; or a pixel-wise hand-labelled image in the case of semantic segmentation.
  • semantic segmentation we will most often refer to the case of semantic segmentation.
  • the methods and techniques claimed shall be applicable to all the various computer vision techniques and all the various architectures of neural networks associated with these techniques.
  • the invention relates in particular to semantic segmentation of images by means of a segmenting neural network.
  • Images are typically composed of pixels that represent a picture taken by a camera with a lens that projects an image on an image sensor.
  • the image sensor converts the projected picture into a pixel matrix that represents the picture.
  • a pic- ture or image can for instance be a frame of a video stream.
  • the pixel matrix representing an image can be fed to an input layer of a segmenting neural network and thus be used as an input image pixel matrix.
  • Each input image pixel matrix represents a sample to be processed by the segmenting neural network.
  • Semantic segmentation of an image represented by a pixel matrix serves to assign regions of the image - or more precise: each pixel in a respective segment of an image - to recognized objects.
  • CNN convolutional neural networks
  • FCN fully convolutional neural networks
  • Those convolutional neural networks are trained as multi-class classifiers that can detect objects they are trained for in an image.
  • a fully convolutional neural network used for semantic segmenta- tion typically comprises a plurality of convolutional layers for detecting features (i.e. occurrence of objects the CNN is trained for) and pooling layers for down-sampling the output of the convolutional layers in a certain stage. Layers are composed of nodes. Nodes have an input and an output.
  • the input of a node can be connected to some or all outputs of nodes in an anteceding layer thus receiving output values from the outputs of nodes of the ante- ceding layer.
  • the values a node receives via its inputs are weighted and the weighted inputs are summed up to thus form a weighted sum.
  • the weighted sum is transformed by an activation function of the node into the output of that node.
  • the weights in the nodes of the layers of the neural network are modified until the neural network provides the desired or expected prediction.
  • the neural network thus learns to predict (e.g. recognize) classes of objects or features it is trained for.
  • a trained neural network implements a model.
  • filter kernel arrays that have a much smaller dimension than the pixel matrix representing the image.
  • the filter kernel arrays are composed of array elements having weight values.
  • a filter kernel array is stepwise moved over the image pixel matrix and each value of a pixel of the image pixel matrix is element wise multiplied with the weight value of the respective element of the filter kernel matrix while the filter kernel matrix “moves over” the image pixel matrix, thus convolving the image pixel matrix.
  • a plurality of filter kernel arrays are used to extract different low level features.
  • the convoluted output from such convolution in a convolutional layer may be fed as input to a next convolutional layer and again be convoluted by means of filter kernel arrays.
  • the convoluted output is called a feature map. For instance, for each color channel of a color image, a feature map can be created.
  • the convoluted output of a convolution layer can be rectified by means of a nonlinear function, for instance a ReLU (Rectified Linear Unit) operator.
  • the ReLU operator is a preferred activation function of the nodes of the convolutional layer, which eliminates all negative values from the output of the convolutional layer.
  • the non-linear function enables the network to learn non-linear relations.
  • ReLU ReLU
  • ReLU(z) max(0, z); cf. https ://www.deeplearn- inqbook.org/contents/mlp.html. Eq. 6.37.
  • the ReLU operators are part of the respective convolutional layer. In order to reduce the dimension of the feature map by way of down-sampling, pooling layers are used. One way of down-sampling is called Max-Pooling.
  • each 2*2 sub-array of the feature map is replaced by a single value that corresponds to the maximum value of the four elements of the 2*2 sub-array.
  • the down-sampled feature map can again be processed in a convolutional layer.
  • the feature map from the final convolutional layer or Pooling-layer is a score map.
  • the neural network For each object class, the neural network is trained for feature score maps and a segment map is generated. If the neural network is for instance trained for five object classes (e.g. cars, traffic signs, human beings, street markings street borders), then five feature score maps are generated.
  • object classes e.g. cars, traffic signs, human beings, street markings street borders
  • the scores in one feature score map represent for one object class the fully convolutional network is trained for representing the likelihood that an object of this object class, the fully convolutional network is trained for is represented in the input image matrix.
  • the objects represented in an input image pixel matrix are "recognized" and a feature score map is generated for each object class wherein for each (down-sampled) pixel a score is formed that indicates the likelihood that a pixel represents an object of the object class.
  • the scores correspond to activation levels of elements of the feature score map.
  • the scores of the elements of the feature score maps can be compared with each other in order to assign each element or pixel, respectively to one of the known object classes.
  • a segment map can be generated wherein the elements are labeled with labels indicating to which of all the objects the segmenting neural network is trained for an individual pixel may belong to.
  • each feature score map can be normalized on a scale between 0 and 1 ; cf. https://www.deeplearninqbook.org/con- tents/mlp.html. Eg. 6.29.
  • the normalized scores for corresponding elements of the feature score maps can be compared and the element can be assigned to that object class that corresponds to the score map having the highest score for the respective element. This is done by means of a maximum likelihood estimator using an Argmax function; cf. https://www.deeplearninqbook.org/contents/ml.html. Eg. 5.56.
  • segment map can be generated from the feature score maps by way of comparing the scores for corresponding elements in the feature score maps.
  • the score for each element of a feature score map for a respective object class represents an activation level of that element.
  • FCN fully convolutional network
  • the final feature score maps are up-sampled to assign to each pixel of the input pixel matrix a score that indicates the likelihood that a pixel represents an object of a known object class.
  • Such up-sampling can be achieved by means of bilinear interpolation.
  • up-sampling can be achieved by a decoder.
  • all architectures are encoder decoder architectures, where down-sampling and up-sampling mean the process of learning simple, more abstract fea- tures from pixels in the first convolutional layer, learning complex features from simple features in the next layer, and so on, and learning the same process vice versa in the decoder.
  • the up-sampling by means of bilinear interpolation is done because the size of the feature score maps of the last layer is not necessarily the same as the size of the input image, and - at least in training - the size of the output needs to match the size of the labels which have the same size as the input.
  • a segment map is generated from the feature score maps. To each element of the segment map a label is assigned that indicates for which of the known object classes the highest activation level is found in the feature score maps.
  • the output of a final convolutional layer, ReLU-layer or pooling-layer can be fed into a fully connected layer that produces an output vector wherein the values of the vector elements represent a score that indicates the likelihood that a respective object is present in the analyzed image.
  • the output vector of such classifying CNN is thus a feature vector.
  • a neu- ral network for semantic segmentation does not have such fully connected layer because this would destroy the information about the location of objects in the input image matrix.
  • the neural network needs to be trained by means of training data sets comprising image data and labels that indicate what is represented by the image data (ground truth).
  • the image data represent the input image matrix and the labels represent the desired output.
  • the weights and the filter kernel arrays are iteratively adapted until the difference between the actual output of the CNN and the desired output is minimized.
  • the difference between the actual output of the CNN and the desired output is calculated by means of a loss function. From a training dataset, containing pairs of an input image and a ground truth image that consists of the correct class labels, the neural network computes the class predictions for each pixel in the output image.
  • a loss function compares the input class labels with the predictions made by the neural network and then pushes the parameters - i.e. the weights in the nodes of the layers - of the neural network in a direction that would have resulted in a better prediction.
  • the neural network will learn the abstract concepts of the given classes.
  • semantic segmentation results in assigning pixels of an image to known object, i.e. objects, the neural network was trained for.
  • Mere semantic segmentation cannot discriminate between several instances of the same object in an image. If, for instance, in an image several cars are visible, all pixels belonging to a car are assigned to the object "car". The individual cars - i.e. instances of the object "car” - are not discriminated. Discriminating instances of objects of an object class requires instance segmentation. Pixels that that do not belong to an object that a neural network is trained for will nevertheless be assigned to one of the object classes the neural network was trained for, and the score might even be high. Even with low scores, the object class with the highest relative score would "win", i.e. the pixel will be assigned to the object class with the highest score.
  • a perception system comprising a segmenting neural network and an uncertainty detector.
  • the perception system can be part of a vehicle, for instance a car, to enable autonomous driving.
  • the perception system can also be part of various autonomous machines, for instance robots, cargo systems or other machines that are designed to operate at least in part autonomously.
  • the segmenting neural network is configured and trained for segmentation of an input im- age pixel matrix to thus generate a segment composed of elements that correspond the pixels of the input image pixel matrix.
  • each element of the segment map is assigned to one of a plurality of object classes the segmenting neural network is trained for by way of class prediction.
  • Elements being assigned to the same object class form a segment of the segment map.
  • the uncertainty detector is configured to generate an uncertainty score map (also called "uncertainty map") that is composed of elements that correspond the pixels of the input image pixel matrix.
  • Each element of the uncertainty map has an uncertainty score that is determined by the uncertainty detector and that reflects an amount of uncertainty involved in a class prediction for a corresponding element in the segment map.
  • the uncertainty detector is configured to access feature score maps generated by the segmenting neural network prior to generating the segment map from said feature score maps.
  • the uncertainty detector is configured to determine the amount of uncertainty and thus the uncertainty score for each element of the uncertainty score map by determining a variance of the activation levels of elements of the feature score maps.
  • the uncertainty detector is configured to determine the amount of uncertainty and thus the uncertainty score for each element of the uncertainty score map by determining an inter-sample variance of the activation levels of an element of a feature score map in different samples of the feature score map.
  • the uncertainty detector is configured to generate inter-sample variances of the activation levels for an element of a feature score map in different samples of the feature score map by processing an input image pixel matrix with the segmenting neural network in multiple passes while for each pass the segmenting neural network is randomly modified.
  • the random modification of the segmenting neural network includes at least one of dropping nodes from a convolutional layer of the segmenting neural network, altering activation functions in at least some nodes of at least some layers of the segmenting neural net-work and/or introducing noise in at least some nodes of at least some layers of the segmenting neural network.
  • the uncertainty detector is configured to determine an inter-sample variance of the activation levels of corresponding elements of feature score maps that are generated form consecutive image pixel matrices corresponding to frames of a video sequence. Since in this case the samples are frames of a video sequence the inter-sample variances correspond to variances between frames and thus are also called "inter-frame variance" hereinafter.
  • inter-frame variance one pass of determining inter-sam- pie variance from n consecutive images will process images a..a+n, and then restart with frames a+n+1...a+2n.
  • the uncertainty detector is configured to determine the amount of uncertainty and thus the uncertainty score for each element of the uncertainty score map by determining inter-class variances between activation levels of corresponding elements of feature score maps for different object classes as provided by the segmenting neural network.
  • the uncertainty detector comprises a generative neural network, in particular a variational autoencoder that is trained for the same classes or objects, the segmenting neural network is trained for.
  • the uncertainty detector is configured for detecting a region that is labeled in the uncertainty score map that is composed of elements having a high uncertainty score and labeling the region that is composed of elements having a high uncertainty score as a candidate for representing an object of a yet unknown object class.
  • High uncertainty scores can be uncertainty scores that are higher than an average uncertainty score of all elements of the uncertainty score map.
  • High uncertainty scores can be uncertainty scores that are higher than a median of the uncertainty scores of all elements of the uncertainty score map.
  • High uncertainty scores can be uncertainty scores that exceed an average uncertainty score or a median of the uncertainty scores of all elements of the uncertainty score map by a predetermined amount.
  • the uncertainty detector is configured for - verifying that the region that is labeled as a candidate for representing an object of a yet unknown object class indeed represents an object of a yet unknown object class by determining the plausibility that a segment of the input image pixel matrix corresponding to the labeled region represents an unknown object.
  • the perception system is configured to reconfigure the segmenting neural network by adding a model for a new object class for instance by way of introducing or configuring further layers with weights that represent the new object class, wherein the new object class includes the so far unknown object represented in the segment found to represent a so far unknown object.
  • Creating a new segmenting neural network that is configured for predicting new object classes i.e. it incorporates a model for a new object class, said model being created by training of the neural network
  • reconfiguring the existing segmenting neural network so as to enable the existing segmenting neural network to predict objects of an additional new object class can be part of a domain adaptation that extends an operation domain of the perception system.
  • Reconfiguring the existing segmenting neural network preferably includes the training of the segmenting neural network with input image pixel matrices representing the so far unknown object in combination with a label for the then new object class (ground truth for the so far unknown and then new object class). Labels can be generated automatically for a segment representing an unknown object.
  • the existing segmenting neural network or preferably a second, similar com- panion neural network is trained with the newly determined yet unknown object class, either in the cloud by uploading the newly detected object class and after training down-loading the trained neural network or trained similar companion neural network, for instance right on the perception system on an autonomous vehicle.
  • the newly detected object class can for instance be directly shared with other autonomous vehicles, so each vehicle can train their segmenting neural network or similar companion neural network on a larger number of yet un-known object classes without having to depend on an update from the cloud.
  • Such training over multiple autonomous vehicles can be parallelized by means of implementing a distributed data parallel training (as explained in: PyTorch Distributed: Experi- ences on Accelerating Data Parallel Training, https://arxiv.org/pdf/2006.15704.pdf). Any such training will preferably be conducted by means of few-shot learning (as explained in: An Overview of Deep Learning Architectures in Few-Shot Learning Domain https://arxiv.org/pdf/2008.06365.pdf).
  • the assignment of individual yet unknown objects to new object classes can be automated by their similarity, preferably determined by means of unsupervised segmentation (as explained in: Unsupervised Learning of Image Segmentation Based on Differentiable Feature Clustering, https://arxiv.org/pdf/2007.09990.pdf).
  • a newly created model for a new object class can be associated automatically to already existing models for known object classes by similarity, by way of determining the similarity by means of one-shot Learning (as explained in: Fully Convolutional One-Shot Object Segmentation for Industrial Robotics, https://arxiv.org/pdf/1903.00683.pdf).
  • One-Shot-Learn-ing the elements of the abstract feature vector of the neural network which conducts the One-Shot Learning have a semantic structure, e.g. the feature vector might consist of textual descriptions of the objects, in which e.g. the description of a rickshaw would be similar to the description of a car, and thus, the newly determined object class for a rickshaw can be treated as car-like by the autonomous vehicle without the need of manual association).
  • a method for semantic segmentation of input image pixel matri- ces by way of class prediction performed by a segmenting neural network and for determining an amount of uncertainty involved in the class predictions for each pixel of an input image pixel matrix comprises segmenting an input image pixel matrix by means of a segmenting neural network that is trained for a plurality of object classes and that generates for each object class a feature score map and therefrom a segment map for the input image pixel matrix by assigning elements of the segment map to one of a plurality of object classes the segmenting neural network is trained for by way of class prediction, elements being assigned to the same object class forming a segment of the segment map and generating an uncertainty score map (short: uncertainty map) composed of elements that correspond the pixels of the input image pixel matrix, each element of the uncertainty map having an uncertainty score that is determined by the uncertainty detector and that reflects an amount of uncertainty involved in a class prediction for a corresponding element in the segment map.
  • uncertainty score map short: uncertainty map
  • the method further comprises - detecting a region in the uncertainty score map that is composed of elements having a high uncertainty score and labeling the region that is composed of elements having a high uncertainty score as candidates for representing an object of a yet unknown object class, a high uncertainty score being an uncertainty score that is higher than an average uncertainty score of all elements of the uncertainty score map.
  • the method further comprises creating a new object class if a region that is composed of elements having a high uncertainty score is detected, said new object class representing objects as shown in a region of the input image pixel matrix corresponding to the regions in the uncertainty score map that are composed of elements having a high uncertainty score.
  • the method for generating new object classes thus may include: detecting a region in the uncertainty score map that is composed of elements having a high uncertainty score generating a tentative new object class and automatically generating a label and recognizing an existing object class or another new object class being similar to the tentative new object class for instance by means of few shot learning.
  • the method may comprise generating a tentative new object class in case an unknown object is detected. If a further unknown object is detected, the method comprises assigning the further unknown object to the existing tentative new object class or to a further new object class, depending on the similarity of the unknown objects. In case a further unknown object can be assigned to a previous unknown object (based on the similarity of the objects) a new object class (that is not tentative anymore) can be generated by one shot learning or few shot learning using samples that represent new objects that are similarto each other.
  • the method for generating new object classes may include generating feature score maps from input image pixel matrices (samples) captured at various instances in time;: detecting a region in the uncertainty score map that is composed of elements having a high uncertainty score, said region representing a first un-known object generating a tentative new object class for said unknown object and automatically generating a label and detecting a further region in a further uncertainty score map that is composed of elements having a high uncertainty score, said region representing a further un- known object determining a similarity between the first and the further and if the similarity exceeds a predetermined threshold, generating a non-tentative new object class.
  • generating a non-tentative new object class includes one-shot learning or few shot learning using samples (i.e. in-put image pixel matrices) that represent the first unknown object and the further unknown object.
  • No explicit labels are need but only automatically generated labels for generating new object classes (for instance by means of a new object detector) and for transfer learning of new object classes (for instance as generated by the new object detector).
  • a relevance of a new object class can be determined based on a user reaction in the context of encountering an unknown, potentially new object. This is explained in more detail further below.
  • User reaction in the context of encountering an unknown, potentially new object can be determined based on optical flow, an inertia sensor (gyroscope) or from signals on the CAN bus of a vehicle.
  • New object classes thus generated fully automatically can be used for training and thus updating the semantic segmentation model implemented by the segmenting neural network of the perception system.
  • a region that is composed of elements having a high uncertainty score can correspond to a part of a segment of the segment map. For instance, if a part of an input image pixel matrix represents a yet unknown object, the pixel of this part will be assigned to a known object class by the segmenting neural network. However, the pixel representing the unknown object (that does not correspond to any of the object classes the segmenting neural network was trained for) will typically exhibit a higher uncertainty score and thus the yet unknown object can be found by means of the uncertainty score map.
  • the uncertainty score is determined by determining an inter-class variance of the activation levels of elements of the feature score maps for different object classes.
  • the uncertainty score is determined by determining an inter-sample variance of the activation levels of elements of different samples of a feature score map for one object class.
  • a further aspect of the invention is the use of the method and its embodiments as illustrated herein in an autonomous machine, in particular in an autonomous vehicle.
  • a system for detection and management of uncertainty in perception systems comprises a data processing unit implementing a segmentation machine comprising a trained segmenting neural network for semantic segmentation of input image matrices.
  • the neural network comprises convolutional layers and is configured to perform a semantic segmentation of an input image pixel matrix.
  • the semantic segmentation is based on multiple object classes the neural network is trained for.
  • the system further comprises an uncertainty detector that is configured to determine an uncertainty score for each pixel, a group of pixels or a segment.
  • the uncertainty score reflects how (un-) certain it is that a pixel of a segment represents the object the segment is assigned to by the segmenting neural network.
  • a low level of certainty i.e. a high level of uncertainty
  • the reason may be that the segment represents an object (i.e. of a yet unknown object class) the neural network of the segmentation machine was not trained for, i.e. an unknown object or object class, respectively.
  • the uncertainty detector can be configured to generate an uncertainty score map wherein for each pixel of the segment map (and thus, typically, for each pixel of the input image pixel matrix) an uncertainty score value is assigned.
  • the dimension of the un- certainty score map matches the dimension of the segment map generated by the segmenting neural network.
  • each element of the uncertainty score map directly corresponds to an element of the segment map and - preferably - also to one pixel of the input image pixel matrix.
  • the uncertainty detector is connected to the segmenting neural network for semantic seg- mentation of input image matrices.
  • the system comprises a segmenting neural network that generates a segment map for each input image pixel matrix and an uncertainty detector that generates an uncertainty score for each pixel of an input image pixel matrix and/or for each element of the segment map.
  • the uncertainty detector can be configured to determine uncertainty from the feature score maps provided by the segmenting neural network for different object classes by way of analyzing the inter-class variance of the activation levels (feature scores) of the pixels of a seg- ment.
  • the uncertainty detector can be configured to determine a softmax confidence.
  • An input image pixel matrix is processed by the segmenting neural network multiple times and each time the segmenting neural network is randomly modified thus providing varying feature score maps. Thus, a number of varying samples of the feature score map are generated.
  • the inter sample variations of the feature score map samples depend on the influence of the variation of the nodes on the scores in the score maps.
  • Processing an input image pixel matrix multiple times with the segmenting neural network while randomly modifying the segmenting neural network renders the segmenting neural network into a Bayesian neural network implementing variational inference.
  • multiple frames of a video sequences can be used as input image pixel matrices. Each frame of a video sequence is an input image pixel matrix.
  • the segmenting neural network is randomly modified.
  • the system is further configured to: - use the segmenting neural network to generate samples of semantically segmented image pixel matrices generate inter-frame variances from a sensor data stream (i.e. a video stream or a video sequence) consisting of frames - i.e. mapping corresponding pixels of consecutive frames and taking each or every n th consecutive frame as a sample instead of sampling each frame multiple times analyze inter-class and/or inter-frame variances in the per pixel activation levels with respect to individual object classes the neural segmenting network was trained for - i.e. analyze the per pixel scores for the individual object classes. determine uncertainty scores by assessing a level or amount of uncertainty based on the analysis of the inter-class and/or inter-frame variances in the per pixel activation levels in the semantically segmented image pixel matrix; and to identify segments showing unknown objects from determined uncertainty.
  • a sensor data stream i.e. a video stream or a video sequence
  • Mapping of corresponding pixels of consecutive frames of a video sequence can include a determination of a displacement of corresponding pixels between two frames, e.g. based on position sensor data that can be gathered for instance by means of an inertia sensor.
  • a movement of a video camera providing a video sequence can be determined enabling a determination of displacement vectors for the pixels between to frames that shall be used for determining inter-frame variance.
  • the system is configured to use more than one source of uncertainty at the same time, e.g. the inter-class uncertainty and the inter-frame uncertainty, and to merge these uncertainties, e.g. by means of a weighted sum where the weights depend on the number of samples already taken. For instance, using the first 2 or 3 frames for determining inter-frame variance might not yield a really good result yet, so for steps 1...m in the sequence, only the inter-class variance might be used, and for frames m+1 ...n, the weight on the inter-frame variance can be increased and the weight in the inter-class variance for the weighted sum of both can be decreased.
  • a Bayesian filter can be used, e.g.
  • a particle filter to merge these uncertainties, and instead of a weight, a confidence score in the uncertainty scores for each of both types of uncertainties (inter-class uncertainty and inter-frame uncertainty) can be provided, and the Bayesian filter would yield a confidence score in the resulting uncertainty.
  • the system further is configured to discriminate different unknown objects through analyzing uncertainty.
  • system further is configured to - create new object classes based on analysis of uncertainty of pixels in segments that are determined as to show unknown objects.
  • the segmenting neural network preferably is a fully convolutional network (FCN).
  • FCN fully convolutional network
  • the uncertainty can be aleatoric or epistemic (systemic).
  • Aleatoric uncertainty arises from statistical variations in the environment. Aleatoric uncertainty often occurs at the border of segments in a segment map.
  • Epistemic uncertainty also known as systemic uncertainty
  • results from a mismatch in a model for instance a model as implemented by a neural network
  • a model for instance a model as implemented by a neural network
  • Uncertainty can be determined and quantified by analyzing the amount of variance of acti- vation levels (i.e. scores in the feature score maps) over multiple samples of the feature score maps that are generated by the segmenting neural network form one input image pixel matrix or a number of similar input image pixel matrices. For each segmentation (i.e. each pass) the segmenting neural network is modified to create variance in the segmenting neural network so the segmenting neural network becomes a Bayesian neural network that implements variational inference. In a Bayesian neural network the input image matrix is processed in a plurality of passes wherein for each pass some nodes of the CNN are dropped out. This technique is known as variational inference.
  • the activation levels vary from pass to pass.
  • This variance of the score i.e. activation levels
  • a Gaussian Noise can be applied to the signals at the input nodes, the weights, or the activation functions, for variational inference, see the thesis of Yarin Gal, http://mlq.eng.cam.ac.uk/varin/thesis/thesis.pdf.
  • Variance of activation levels can also arise from using a sequence of pictures of the same scene taken from different locations or at different points of time.
  • Such sequence can be frames of a video sequence recorded by a video camera of a moving vehicle. This aspect is referred to as "inter-frame variance" as mentioned earlier in this text.
  • the neural network is not a Bayesian neural network
  • uncertainty can be determined by determining the amount of variance or a pattern of variance of the activation levels. Uncertainty can also be determined via a reconstruction error of a generative neural network.
  • Unknown objects i.e. objects the neural network is not trained for or the neural network does not recognize
  • level or "amount of uncertainty of a pixel” a degree of uncertainty is meant that is determined by analyzing the variance of the activation levels (i.e. scores) of the elements of the score maps for the object classes.
  • the elements (i.e. each single element) of the feature score maps (arrays) for the object classes each relate to one or more pixel of the input image matrix. Accordingly, the amount of uncertainty - and thus the uncertainty scores - that can be determined from the activation levels (i.e. scores) of the elements of the score maps also is the amount of uncertainty of the image pixel(s) that correspond to the respective elements of the score maps.
  • Preferred wavs of determination of uncertainty are:
  • the uncertainty scores can be determined based on the amount of the variance or of a pattern of the variance, or of the kind of variance (i.e. epistemic or aleatoric) or on the temporal variation of the variance of the activation levels (i.e. scores of the feature score maps) or of a combination of these parameters.
  • the uncertainty determined from the activation levels can be mapped on the input image pixel matrix according to the size of the inputs orthe size of the segment map.
  • the existence of a segment that represents an unknown object can be determined based on the distribution of the amount of uncertainty (i.e. the distribution of the uncertainty scores in the uncertainty score map) assigned to pixels of the input image matrix wherein the amount of uncertainty (i.e. the uncertainty score of each element/pixel) is determined based on the variance of the activation levels of elements of the feature score maps that relate to the pixels of the input image matrix.
  • An image segment representing an unknown object can be determined by determining a cluster of pixels with a high uncertainty score (i.e. for which a high amount of uncertainty is found) or by determining contiguous pixels having a similar amount of uncertainty (and thus a similar uncertainty score).
  • the amount of uncertainty and thus the uncertainty score of a pixel is determined by determining the variance of the activation levels of those elements of the feature score maps that correspond to the pixel in the input image pixel matrix.
  • the size of a segment representing an unknown object can be determined from the size of a cluster of pixels with a high uncertainty score or by the length of the outline of a field of contiguous pixels having a similar uncertainty score.
  • the position of a segment representing an unknown object can be determined from the position of a cluster of pixels having a high uncertainty score or by determining the position of the outline of contiguous pixels exhibiting a similar uncertainty score.
  • the relative movement of a segment representing an unknown object can be determined from the temporal change of the position of a cluster of pixels having a high uncertainty score or of the position of an outline of contiguous pixels exhibiting a similar uncertainty score.
  • the relative distance between a camera generating the input pixel matrix and an unknown object represented in a segment can be determined from the temporal change of the variance of the level of activations.
  • a 3D position of the unknown object from a sequence of 2D input image pixel matrices can be computed.
  • the relative distance between a camera generating the input pixel matrix and an unknown object represented in a segment can be determined by an additional depth estimation model.
  • segmenting neural network Since the segmenting neural network will assign every pixel to a known object class, un- known objects are not recognized by the segmenting neural network. However, pixel that represent a yet unknown object can be determined by means of analyzing uncertainty.
  • Ways to make it plausible that a segment of the input image matrix represents an unknown object - and thus ways to find a plausible confirmation that segments of pixels with a high amount of uncertainty represent an unknown object are: -
  • the plausibility of a segment representing an unknown object can be derived from the determined outline of the segment and its correlation with the segment prediction, i.e. the semantic segmentation, produced by the neural network. If an entire segment is comprised of pixels with a high uncertainty score (i.e. the majority of pixels within a segment outline) it is likely that the segment represents an unknown object.
  • the plausibility of a segment representing an unknown object can be determined from the detected outline of the segment and its correlation with an alternative segmentation of the input image matrix, for instance by means of an auto-encoder.
  • the plausibility of a segment representing an unknown object can be determined from the temporal variation of the variance of the activation levels.
  • the plausibility of a segment representing an unknown object can be determined by means of a comparison of that segment with another segment representing an unknown object found by a coexisting, redundant subsystem for semantic segmentation.
  • the plausibility of a segment representing an unknown object can be determined based on a comparison with another segment representing an unknown object found by a different system for semantic segmentation that uses an input image matrix representing the same scenery taken from a different perspective.
  • Such comparison includes a transformation of the two segments representing an unknown object in a common, e.g. global coordinate system.
  • a system comprising a primary segmentation system and a separate autonomous device.
  • the primary segmentation system comprises a primary perception system comprising a segmenting neural network that is configured and trained for segmentation of an input image pixel matrix to thus generate a segment map.
  • a primary perception system comprising a segmenting neural network that is configured and trained for segmentation of an input image pixel matrix to thus generate a segment map.
  • Each element of the segment map is assigned to one of a plurality of object classes the segmenting neural network is trained for by way of class prediction. Elements that are assigned to the same object class form a segment of the segment map.
  • the primary segmentation system may be part of an auton- omous driving system (ADS) and typically does not comprise an uncertainty detector.
  • ADS auton- omous driving system
  • the separate, autonomous device comprises a sensor for generating an input image pixel matrix comprised of matrix elements, a segmenting neural network and an uncertainty detector.
  • the uncertainty detector is configured to generate an uncertainty score map composed of matrix elements that correspond the pixels of the input image pixel matrix. Each matrix element of the uncertainty map has an uncertainty score that is determined by the uncertainty detector and that reflects an amount of uncertainty involved in a class prediction for a corresponding element in the segment map.
  • the segmenting neural network of the autonomous device preferably is trained for the same object classes as the segmenting neural network of the primary system.
  • the autonomous device preferably further comprises signaling means that are configured to generate a user perceivable signal in case the uncertainty map generated by the uncertainty detector of the autonomous device comprises regions comprised of matrix elements that exhibit an uncertainty score above a threshold and thus represent edge cases for the object classification and image segmentation. Edge cases may represent yet unknown ob- jects and thus are candidates for new object classes.
  • the autonomous device can determine whether or not the segmentation achieved by the primary perception system is reliable or not. In particular, the autonomous device can determine whether the segmentation score map generated by the segmenting neural network of the primary segmentation system contains regions that represent edge cases, for in- stance objects, the primary segmentation system was not trained for.
  • the autonomous device preferably comprises a new object detector that is operatively connected to the uncertainty detector of the autonomous device.
  • the new object detector is configured to find a region in the uncertainty score map that is composed of elements having a high uncertainty score and generate a new object class for such found region.
  • the autonomous device can be configured to exchange labeled data sets comprising segments assigned to a newly generated object class with other autonomous devices to thus increase the number of known object classes available for semantic segmentation by the autonomous devices.
  • the label for a newly generated object class may be generated automatically.
  • the autonomous device may be configured to record a reaction in response to a warning signal emitted by the autonomous device.
  • the type of reaction can be used as input data for discriminating relevant unknown objects from less relevant unknown objects. A significant reaction indicates a high relevance of an unknown object.
  • the autonomous device can be configured for learning to discriminate unknown objects by automatically generating new object classes. Furthermore, the autonomous device can be configured for learning a relevance level for each (new) object class and thus can adopt the warning signal level to the object class. If the autonomous device determines the presence of an object (by way of image segmentation) in a region of interest of the input image pixel matrix, the warning signal can be generated in dependence of the relevance of the recognized object.
  • New object classes maybe labeled automatically such that the label represents the relevance level. For instance, the relevance level may be used as label for each new object class.
  • the autonomous device can be configured to exchange data sets comprising data repre- senting the relevance level of a known object class or an observed user reaction (observed behavior) with other autonomous devices.
  • the observed user reaction can be used for automatically generating a label for a new object class.
  • the autonomous device may comprise a data interface, in particular a wireless data interface that allows data exchange over e.g. the internet.
  • the autonomous device preferably is a pocketable, mobile device a user easily can carry and that can easily be mounted for instance to a windshield of a vehicle.
  • the auton- omous device preferably is mounted in a position where the autonomous device's viewing angle at least in part corresponds to the viewing angle of the sensor 12 or sensors of the autonomous driving system.
  • the autonomous device may comprise a segmenting neural network that implements a semantic segmentation model, which is preferably continuously trained with the output of the new object detector as input to thus enable the neural network to predict new objects encountered before.
  • Fig. 1 is a schematic overview of a perception system providing output for other sys- terns, e.g. a sensor fusion system;
  • Fig. 2 is schematic representation a neural network suitable for semantic segmentation
  • Fig. 3 is a schematic illustration of how an object class is assigned to a pixel of the input image pixel matrix depending on the scores in the score maps of the respective object classes
  • Fig. 4 is a schematic overview of a data processing unit for use in a perception system according to figure 1 including an uncertainty detector according to the invention
  • Fig. 5 illustrates training of a neural network based on input data sets comprising images and labels (i.e. desired output, ground truth);
  • Fig. 6 illustrates prediction of a neural network based on an input data set
  • Fig. 7 shows an image as represented by an input data set for the semantic segmentation neural network
  • Fig. 8 shows the prediction, i.e. the semantic segmentation provided by the neural network for the image in figure 5;
  • Fig. 9 shows regions and segments wherein the pixels have a high amount of uncertainty for the image of figure 5;
  • Fig. 10 is a schematic illustration of a system comprising a perception subsystem, a planning subsystem and a control subsystem;
  • Fig. 11 illustrates an unknown object plausibility check based on two independent perception subsystems of one vehicle
  • Fig. 12 illustrates an unknown object plausibility check based on two independent perception subsystems of two vehicles
  • Fig. 13 shows the correlation of the shape of uncertainty with an unknown object
  • Fig. 14 illustrates the determination of a segment with an unknown object
  • Fig. 15 illustrates sourcing of a domain adaptation dataset
  • Fig. 16 illustrates uploading and integration of a domain adaptation dataset
  • Fig. 17 illustrates the determination of classes of unknown objects
  • Fig. 18 illustrates training of a neural network with a domain adaptation dataset to generate a new model that can process new object classes
  • Fig. 19 illustrates training and download of a model for a neural network
  • Fig. 20 illustrates an autonomous device that serves as a safety companion
  • Fig. 21 illustrates generation of a training data set for a new object
  • Fig. 22 illustrates a training data set for a new object
  • Fig. 23 illustrates sharing of a training data set for a new object
  • Fig. 24 illustrates training a segmenting neural network with a training data set for a new object
  • Fig. 25 illustrates the identification of false positive new objects
  • Fig. 26 illustrates a perception system with a secondary system and a primary system for object detection
  • Fig. 27 illustrates transfer learning from a segmenting neural network with an unknown structure
  • Fig. 28 illustrates the training of a variational autoencoder
  • Fig. 29 illustrates uncertainty detection by means of variational autoencoder
  • Fig. 30 illustrates one-shot learning of an uncertainty detector
  • Fig. 31 illustrates the use of multiple parallel redundant uncertainty detectors
  • Fig. 32 illustrates an alternative device for object recognition
  • Fig. 33 illustrates a segmentation model that can be part of an object recognition system
  • Fig. 34 illustrates the use case of action recognition
  • Fig. 35 illustrates a video action recognition model
  • Fig. 36 illustrates a temporal shift module; illustrates a depth estimation model; Fig. 38 illustrates the use case of risk estimation;
  • Fig. 39 illustrates the use case of action anticipation
  • Fig. 40 illustrates an action forecast model
  • Fig. 41 illustrates the use case of edge case recognition
  • Fig. 42 illustrates cascaded monitoring concept
  • Fig. 43 illustrates a situation monitor
  • Fig. 44 illustrates an autoencoder model
  • Fig. 45 illustrates a convolutional submodule
  • Fig. 46 illustrates a convolutional submodule with dropout
  • Fig. 47 illustrates a Bayesian sampling module
  • Fig. 48 illustrates a Lipschitz submodule
  • Fig. 49 illustrates integration of the situation monitor with a Kalman filter
  • Fig. 50 illustrates the use case of similarity prediction
  • Fig. 51 is an overview over a multi-functional model
  • Fig. 52 illustrates an encoder of a multi-functional model
  • Fig. 53 illustrates segmentation by means of a multi-functional model
  • Fig. 54 illustrates video action recognition by means of a multi-functional model
  • Fig. 55 illustrates depth estimation by means of a multi-functional model
  • Fig. 56 illustrates an autoencoder realized with a multi-functional model
  • Fig. 57 illustrates action forecast by means of a multi-functional model
  • Fig. 58 illustrates how the system can be implemented following the Sense-Plan-Act concept
  • Fig. 59 illustrates a preferred sensor configuration.
  • a perception system 10 as shown in figure 1 comprises a camera 12 that records images by means of an image sensor (not shown).
  • the recorded images can be still images or sequences of images that form frames of a video stream.
  • the input data stream can also be from a LiDAR (stream of 3D point clouds) or from a radar.
  • the camera 12 generates an image pixel matrix that is fed to a data processing unit 14.
  • the data processing unit 14 implements a segmenting neural network 40 (see figures 2 and 3) for semantic segmentation of images.
  • the perception system can be integrated in various devices or machines, in particular in vehicles, for instance autonomous vehicles, e.g. cars.
  • vehicles for instance autonomous vehicles, e.g. cars.
  • One aspect is an integration of a perception system with an autonomous vehicle.
  • the system that implements the segmenting neural network is a perception system 10.
  • the output of the perception system 10 is provided to a sensor fusion system 24.
  • the neural network as implemented by the data processing unit 14 is defined by a structure of layers comprising nodes and connections between the nodes.
  • the neural network comprises an encoder part formed by convolution layers and pooling layers.
  • the convolutional layers generate output arrays that are called feature maps.
  • the elements of these feature maps (i.e. arrays) represent activation levels that correspond to certain features in the input image pixel matrix.
  • Features generated by one layer are fed to a next convolutional layer generating a further feature map corresponding to more complex features.
  • the activation levels of a feature map correspond the objects belonging to a class of an object the neural network was trained for.
  • the effect ofthe convolution in the convolutional layers is achieved by convoluting the input array with filter kernel arrays having elements that represent weights that are applied in the process of convolution. These weights are generated during training of the neural network for one or more specific object classes. Training of the neural network is done by means of training data sets comprising image data (input image pixel matrix 66, see figure 5) and labels 68 that indicate what is represented by the image data (ground truth). The image data represent the input image matrix 66 and the labels 68 represent the desired output. In a back propagation process, the weights and the filter kernel arrays are iteratively adapted until the difference between the actual output of the CNN and the desired output is minimized. The difference between the actual output (segment map 72) of the CNN and the desired output (labeled input image matrix 66) is calculated by means of a loss function 70.
  • the neural network 40 From a training dataset, containing pairs of an input image and a ground truth image that consists of the correct class labels, the neural network 40 computes the class predictions for each pixel in the output image, i.e. the segment map 72.
  • a loss function 68 compares the input class labels 68 with the predictions (i.e. the labels 68' in the segment map 72) made by the neural network 40 and then pushes the parameters - i.e. the weights in the nodes of the layers - of the neural network in a direction that would have resulted in a better prediction.
  • the neural network will learn the abstract concepts of the given classes; cf. figure 5.
  • a trained neural network is thus defined by its topology of layers (i.e. the structure of the neural network), by the activation functions of the neural network's nodes and by the weights of the filter kernel arrays and potential the weights in summing up nodes of layers such as fully connected layers (fully connected layers are used in classifiers).
  • the topology and the activation functions of a neural network - and thus the structure of a neural network - is defined by a structure data set.
  • the weights that represent the specific model a neural network is trained for are stored in a model data set.
  • the model data set must fit the structure of the neural network as defined by the structure data set.
  • At least the model data, in particular the weights determined during training, are stored in a file that is called "checkpoint".
  • the model data set and the structure data set are stored in a memory 16 that is part of or is accessible by the data processing unit 14.
  • the data processing unit 14 is further connected to a data communication interface 18 for exchanging data that control the behavior of the neural network, e.g. model data as stored in a model data set or training data as stored in a training data set used for training the neural network.
  • data that control the behavior of the neural network, e.g. model data as stored in a model data set or training data as stored in a training data set used for training the neural network.
  • a display unit 20 may be connected to the data processing unit 14.
  • the segmentation will not be shown on a display unit but post-processed into an object list and then provided as input to a sensor fusion system, e.g. a Kalman filter.
  • Information of the absence or presence and - if present - the position of an image representation of an object in the input image pixel matrix can be derived from the segmented input image pixel matrix. This information can be encoded in data and can be used for control or planning purposes of further system components. Accordingly, an output data interface 22 is provided that is configured to pass on data indicating the absence or presence and the position of a recognized object in the input image pixel matrix. To the output data interface 22, a sensor fusion system 24 can be connected. The sensor fusion system 24 typically is connected to further perception systems not shown in the figure.
  • the sensor fusion system 24 receives input from various perception systems 10, each within or associated with one particular sensor such as a camera, a radar or a LiDAR.
  • the sensor fusion system 24 is implemented by a Bayesian filter, e.g. an extended Kalman filter, which processes the inputs of the perception systems 10 as measurements.
  • these inputs are not the segment maps, but post-processed object lists, derived from the segment maps.
  • the extended Kalman filter will associate each meas- urement with a measurement uncertainty value, which is usually configured statically, e.g. a sensor uncertainty value, configured according to a model of the particular sensor.
  • a measurement uncertainty value which is usually configured statically, e.g. a sensor uncertainty value, configured according to a model of the particular sensor.
  • the uncertainty scores generated by the uncertainty detector 60 can be easily integrated by adding a model uncertainty to the measurements. This is how an uncertainty detector can be integrated with an autonomous vehicle.
  • Image segmentation e.g. an extended Kalman filter
  • the trained segmenting neural network 40 has an encoder - decoder structure as schematically shown in figure 2.
  • the encoder part 42 is a fully convolutional network (FCN), for example a ResNet 101.
  • FCN fully convolutional network
  • Alternative implementations could be VGG-16 or VGG-19.
  • the decoder part 44 can be implemented as a DeepLab aspp (atrous spatial pyramid pooling).
  • the segmenting neural network 40 is a fully convolutional network composed of convolutional layers 46, pooling layers 48, feature maps 50, score maps 52 (i.e. up-sampled feature maps) and one segment map 54.
  • the segmenting neural network 40 is configured and trained for three object classes. Accordingly, on each level, three convolutional layers 46 are provided that each generate a feature map for one object class. Each convolutional layer implements ReLU activation functions in the nodes. Pooling layers 48 reduce the size if the feature map to thus generate an input for a next convolutional layer 46.
  • Feature maps 50 of the trained segmenting neural network each reflect the likelihood that an object of one object class is represented by according elements of the input image pixel matrix. Typically, the feature maps have a smaller dimension than the input image pixel matrix.
  • the feature maps are up-sampled and up-sampled feature maps 52 (herein also called score maps) are generated wherein the elements have score values that reflect the likelihood that a certain pixel represents an object of the object class the segmenting neural network was trained for.
  • each element i.e. each pixel
  • the argmax function a maximum likelihood estimator
  • a pixel is assigned to that object class showing the highest score for that pixel or element, respectively.
  • one segment map 54 is generated from three feature score maps 52.
  • the ReLU activation function applied is a customized ReLU function that allows introducing random dropping of nodes so the segmenting neural network 40 can act as Bayesian neural network.
  • Figure 3 illustrates that the feature maps 50 and 52 for each object class A, B or C are acting in parallel (and not in series as figure 2 might suggest). If, for instance, the neural network is trained for three objects, three score maps are generated. By means of a Soft- Max function, the scores in each feature score map are normalized on a scale between 0 and 1 . The normalized scores for corresponding elements of the feature maps can be compared and the element can be assigned to that object class that corresponds to the feature score map having the highest score for the respective element. This is done by means of a maximum likelihood estimator using an argmax function.
  • segment 56 is a segment, where the activation levels of the array elements are higher than the activation levels of corresponding array elements of the score maps for object classes A and B.
  • Segment 58 is a segment, where the activation levels of the array elements are higher than the activation levels of corresponding array elements of the score maps for object classes B and C.
  • each pixel in the segment map 54 is assigned to one object class of all the object classes the segmenting neural network 40 is trained for. Accordingly, there are no non-assigned elements or pixels.
  • Figure 3 might be misleading in that respect because figure 3 does not show that all pixels are assigned to a known object class and thus are part of a segment
  • Figure 6 illustrates the generation of a segment map 72 from an input image 66 by means of a prediction performed by the segmenting neural network 40.
  • the segmenting neural network 40 is part of the data processing unit 14.
  • Figures 7 and 8 illustrate that all pixels in an input image are assigned to known objects.
  • the input image 66' - and thus the input image pixel matrix 66' - comprises unknown objects like boxes 90 on the street, these unknown objects 90 are assigned to known objects such as street markings 92.
  • uncertainty scores on the level of pixels (per pixel) of the input image pixel matrix are created or generated. This can be achieved by provoking or determining forms of variance in the activation levels (scores) of the individual pixels with respect to the object classes.
  • the activation levels (i.e. the score) of the elements in the feature score maps for the individual object classes the neural network is trained for may vary, if frames of a video sequence are regarded or if the semantic segmentation is repetitively performed in multiple passes where varying nodes are dropped.
  • the variance can be temporal - i.e. between different passes or between different frames of a video sequence - or spatial, i.e. within an image pixel matrix, for instance at the edges of segments.
  • the variance can be achieved by setting the activation to zero with a random probability.
  • a dropout operation is inserted into the layers to which dropout shall be applied, so the inner architecture of the layer looks like convolution/dropout/non-linearity.
  • amounts of uncertainty can be determined and quantified by analyzing the amount of variance of activation levels (i.e. scores in the feature score maps) over multiple samples.
  • a Bayesian neural network i.e. for instance a convolutional neural network that is randomly modified, e.g. by randomly dropping nodes
  • the input image matrix is pro- Defined in a plurality of passes wherein for each pass some nodes of the convolutional neural network are dropped out.
  • This technique is known as variational inference. Due to the dropping of nodes in some passes, the activation levels vary from pass to pass resulting in an inter-sample per pixel variance. The higher this variance is, the higher is the amount of uncertainty and thus the uncertainty score.
  • Inter sample variance of activation levels can also arise from using a sequence of pictures of the same scene taken at different points of time. Such sequence can be a frame of a video sequence recorded by a video camera of a moving vehicle. If the neural network is not a Bayesian neural network, per pixel uncertainty can be determined by determining the amount of inter-class variance or a pattern of variance of the activation levels (spatial variance).
  • Uncertainty can also be determined via a reconstruction error of a generative neural net- work, for instance by means of a variational autoencoder.
  • Unknown objects i.e. objects the neural network is not trained for or the neural network does not recognize
  • the uncertainty map represents the uncertainty score for each pixel.
  • the uncertainty map is created by analyzing the activation levels of the matrix elements (corresponding to the image pixels) in the feature score maps for the different object classes, the segmenting neural network is trained for.
  • the uncertainty score map can be created from analyzing the variance of the activation levels over a plurality of passes when applying i.e. Monte Carlo drop out to the segmenting neural network to thus cause variance between the samples generated with each pass (inter-sample variance).
  • an uncertainty detector may be provided for tracking the variance over the passes and generating an uncertainty map.
  • unknown objects are represented by pixels of an input image pixel matrix
  • these pixels and thus the unknown objects represented by these pixels can be "found” by determining the uncertainty scores.
  • This can be used to assign segments of the input image pixel matrix to unknown objects and even further to discriminate between different unknown objects in such segment of the input image pixel matrix.
  • the basic steps the system according to the invention performs are: using a segmenting neural network to generate samples of semantically segmented image pixel matrices - analyze inter-class and/or inter-frame variances in the per pixel activation levels with respect to individual object classes the neural network was trained for - i.e. analyze the per pixels scores for the individual object classes. determining uncertainty by assessing a level or amount of uncertainty based on the analysis of the inter-class and/or inter-frame variances in the per pixel activation levels in the semantically segmented image pixel matrix; and identify segments showing unknown objects from determined uncertainty
  • the system further discriminates different unknown objects through analyzing uncertainty and optionally creates new object classes based on analysis of uncertainty of pixels in segments that are determined as to show unknown objects.
  • the data processing unit 14 comprises an uncertainty detector 60, see figure 4.
  • the uncertainty detector 60 can be configured to determine uncertainty from the feature score map 52 provided by the segmenting neural network 40 by way of analyzing the interclass variance of the values of the activation levels before the maximum likelihood estimator, i.e. before the argmax functions is applied and the segment map 54 is generated. Be- fore the maximum likelihood estimator, i.e. before the argmax function is applied and the segment map 54 is generated, each pixel has activation levels for each known class. By way of the argmax function, the pixel typically is assigned to the class with the highest activation level. Prior to applying the argmax function interclass variances for each pixel can be determined.
  • the uncertainty detector 60 can comprise a generative neural network 62 implementing a generative model that is based on the same training dataset as the segmenting neural network 40 to reproduce the input image.
  • the generative neural network 62 preferably is a variational autoencoder.
  • the per pixel reconstruction loss between the input image and the image reconstructed by the generative neural network 62 corresponds to the uncertainty.
  • a higher reconstruction loss reflects a higher amount of uncertainty.
  • the uncertainty detector 60 is configured to implement a Bayesian neural network as means for generating inter-sample variance and thus an uncertainty score.
  • the uncertainty detector 60 is preferably configured to apply a customized ReLU function that allows introducing random dropping of nodes so the segmenting neural network 40 can act as a Bayesian neural network. Uncertainty can thus be determined by means of repeating the prediction several times (sampling) under variation of the segmenting neural network 40 (e.g. dropout of nodes) or insertion of noise into the weights or acti- vation functions.
  • a tensor library 64 comprises the customized ReLU used for rendering the segmenting neural network 40 into a Bayesian neural network.
  • the data processing unit 14 When in use, the data processing unit 14 receives input image pixel matrices 66 as an input. The data processing unit 14 generates a segment map 72 and an uncertainty score map 74 as outputs.
  • Figure 9 illustrates an uncertainty score map 74 generated from the input image shown in figure 7.
  • Figure 8 illustrates the segment map generated from the input picture shown in figure 7.
  • the unknown objects 90 i.e. boxes on the street
  • the pixels that represent the unknown objects are assigned to various known objects. While this is correct in terms of the function of the segmenting neural network, from a user's perspective the segmenting result is wrong as far as the unknown objects are concerned.
  • Such "wrong" assignment can be detected by means of the uncertainty score of the pixels in the segments (priorto assigning the pixels to the known object classes, e.g. prior to applying the function for classifying each pixel).
  • the amount of variance can be determined by provoking and/or analyzing forms of variance in the activation levels (scores) of the individual elements of the feature score maps for the object classes the segmenting neural network is trained for.
  • the amount of variance can be determined between different samples of a feature score map for one object class, thus determining an inter-sample variance.
  • the amount of variance can be determined between feature score maps for different object classes, thus determining an inter-class variance.
  • the uncertainty detector can be configured to determine prediction uncertainty in various ways.
  • the per-pixel uncertainty scores that indicate prediction uncertainty can for instance be determined: by means of the values of the activation levels before the maximum likelihood estimator, where the uncertainty detector compares the relative activation levels of respective pixels in the feature score maps, e.g. by means of class agnostic threshold on the activation levels, by means of class specific thresholds on the activation lev- els, or by respective thresholds on the difference between the activation levels of classes, e.g. the top-2 activation levels, where a higher difference would indicate lower uncertainty and vice versa. by means of training a generative model on the same training dataset, e.g.
  • a variational autoencoder to reproduce the input image and take advantage of the fact that the variational autoencoder will fail to reproduce unknown object classes.
  • a pixel-wise (per pixel) uncertainty score is determined, where a higher reconstruction loss would indicate higher uncertainty.
  • variational inference e.g. realized by the application of Monte Carlo dropout in training mode and in prediction mode of the segmenting neural network in combination with sampling in prediction mode of the segmenting neural network, where the uncertainty detector takes advantage of the fact that unknown object classes (i.e.
  • unknown objects in the input image pixel matrix will result in higher pixel-wise inter-sample variance between the samples, and higher variance would indicate higher uncertainty and vice versa, or. e.g. realized by the application of Gaussian noise to the values of the weights or the activation functions with the same effect as Monte Carlo dropout.
  • the uncertainty detector can take e.g. the values of the pixel-wise activation levels in the feature score maps, the reconstruction loss values of the pixel-wise reconstruction loss of a variational autoencoder, or the values of the pixel-wise variance in variational inference or a combination of these for determining uncertainty scores for the elements of the uncer- tainty score map.
  • Lower activation levels indicate higher uncertainty scores.
  • a lower difference between the activation levels of the top-2 classes or a higher variance over all object classes indicates higher uncertainty scores.
  • a higher variance between the samples in variational inference indicates higher uncertainty scores.
  • variational inference there are variances per pixel per object class, which the uncertainty detector aggregates into one variance per pixel, e.g. by taking the sum or by taking the mean.
  • Determining segments that represent an object of an unknown object class is done to determine whether the perception system 10 operates within or without its operation domain.
  • the domain is defined by the object classes, the segmenting neural network 40 is trained for.
  • Determining segments that represent an object of an unknown object class thus is part of an out-of-domain detection.
  • the out-of-domain detection is simply marked as domain detection.
  • the detection of segments representing objects of a yet un- known, new object class can be used creating models representing the new object class and thus for domain adaptation.
  • the uncertainty, derived from the variance or the activation levels is a pixel image, so for each pixel in the segment map exactly one corresponding uncertainty score is found in the uncertainty score map.
  • the uncertainty detector comprises a Bayesian neural network as means for detecting inter-sample uncertainty and thus uncertainty scores for the pixels or elements, respectively, aleatoric uncertainty will show at the edges of segments (see figure 9, the light colored "frame" around a segment).
  • the origin of aleatoric uncertainty is due to the fact that in the labelling process, the segments in the training dataset have been labelled manually or semi-automatically by humans, who will label the edges of objects sometimes more narrow and sometimes more wide.
  • the neural network will learn this variety of labelling styles, and sometimes predict the edges of segments more narrow and sometimes more wide. This also happens when the samples for variational inference are taken in prediction, lead- ing up to uncertainty at the edges of segments.
  • the uncertainty detector preferably is configured to match the regions in the uncertainty map in which aleatoric uncertainty occurs to the edges of the predicted segments in the segment map by means of standard computer vision techniques for edge detection, corner detection, or region detection (aka area detection). Where the aleatoric uncertainty matches the edges of predicted segments, a correct prediction of the segment is indicated. If aleatoric uncertainty is missing at the edge of a predicted segment in the segment map or if aleatoric uncertainty exists within a segment, an incorrect prediction is indicated.
  • the uncertainty detector comprises a Bayesian neural network as means for detecting inter-sample uncertainty and thus for generating uncertainty scores for the pixels or ele- ments, respectively, epistemic uncertainty will show inside of segments (see figure 9, the light colored segments 90). So will the other approaches for detecting uncertainty show uncertainty inside of segments.
  • Epistemic uncertainty directly indicates an incorrect segment prediction, i.e. an incorrect segmentation in the segment map.
  • the uncertainty detector is configured to determine regions of uncertainty e.g. by means of standard computer vision techniques for edge detection, corner detection, or region detection (aka area detection). Besides the fact, that there is uncertainty indicated by pixels in the uncertainty score map, that these pixels gather at the edges of segments of the segment map or within uncertain segments of the segment map, the uncertainty values of these per pixel uncertainties also vary within segments.
  • the uncertainty detector can comprise a small classifying neural network, i.e. a classifier, that is trained with the uncer- tainty score map in combination with labelled training dataset.
  • a classifying neural network i.e. a classifier
  • object classes are used that do not belong to the original training domain (as defined by the object classes the segmenting neural network is originally trained for) of the original segmenting neural network.
  • the uncertainty detector optimizes the matching of the determined regions of uncertainty of the uncertainty score map where the elements of the uncertainty score map have high uncertainty scores to the real shapes of the incor- rectly predicted segments in the segment map.
  • the uncertainty score map is segmented in accordance with the segment map.
  • the uncertainty detector plausibilizes the regions determined by this classifier with the regions determined by means of the standard computer vision techniques.
  • Fig. 14 illustrates the prediction by a neural network and the measurement of prediction uncertainty. From the input image on the left side, the segmenting neural network computes the class predictions for each pixel on the right side and the uncertainty detector computes uncertainty value predictions for each pixel. Fig. 13 shows the correlation of the shape of uncertainty with an unknown object.
  • the uncertainty detector can be implement in different ways. For a Bayesian neural network prediction uncertainty can be determined by means of repeating the prediction several times (sampling) under variation of the segmenting neural network (e.g. dropout of nodes) or insertion of noise into the weights or activation functions.
  • the activation levels in the score maps (after the softmax function but before the maximum likelihood estimator (argmax)) over all the samples will then have a high inter-sample vari- ance for the parts of the input which do not correspond to the training data and a low variance for the parts of the input which correspond well to the training data.
  • the uncertainty detector interacts with the segmenting neural network and randomly modifies the segmenting neural network and in particular at least some nodes of the convolutional layers of the segmenting neural network.
  • the uncertainty detector can for instance be configured to modify at least some of the ReLU activation functions of the convolutional layers to thus cause dropping of nodes or to modify values of weights in some nodes.
  • the uncertainty values can be determined from only one sample by way of determining the inter-class variance between the softmax scores over the score maps (every pixel has a softmax score in every score map).
  • This inter-class variance can be determined e.g. as the difference between the top-2 softmax scores, i.e. the difference between the largest softmax score and the second largest softmax score, or as the variance between the softmax scores over all object classes.
  • a threshold could be applied. This embodiment does not require that the uncertainty detector modifies the segmenting neural network.
  • the uncertainty detector can comprise a generative neural network, e.g. a variational autoencoder, to recreate the input image and measure the uncertainty by exactly the same means as described above or by determining a reconstruction loss.
  • the generative neural network e.g. a variational autoencoder, implements the same model (i.e. is trained for the same objects or classes) as the segmenting neural network.
  • the uncertainty detector is configured to implement a Bayesian neural network as means for generating inter-sample uncertainty.
  • the process of sampling consumes a lot of time as a statistically relevant number of samples is needed for a reliable result.
  • the uncertainty detector preferably is configured to compute the inter-sample variance over subsequent instances (for instance subsequent frames of a video sequence), while sampling each instance only one or a few times. If the instances are input image pixel matrices corresponding to frames of a video sequence recorded by a vehicle, this is possible because the pixels of these instances correspond to each other within a variable amount of shift, rotation, and scale due to the movement of the vehicle.
  • the uncertainty detector is configured to match single pixels or regions of pixels of one instance to corresponding pixels or regions of pixels of subsequent instances to determine the inter-sample variance between the feature score values of these pixels.
  • Regions in an uncertainty score map generated by the uncertainty detector that exhibit elements with a high uncertainty score are candidates for segments that represent an un- known object, i.e. not the object the segment is assigned to in the segment map and thus not an object belonging to an object class the segmenting neural network is trained for.
  • the uncertainty detector preferably is configured to further plausibilize a segment deter- mined as representing an unknown object by means of communication with other vehicles in the vicinity of a scene.
  • the uncertainty detector will map the scene to a global coordinate system, which is e.g. the coordinate system of the HD map used by the vehicle for localization.
  • a global coordinate system which is e.g. the coordinate system of the HD map used by the vehicle for localization.
  • the locations of objects in the global coordinate system are detected that correspond to segments in the segment map corresponding to known or to unknown objects. If segment maps generated from input image pixel matrixes recorded from different locations are compared, it is possible to compare segments that are candidates for representing an unknown object.
  • Input image pixel matrices from different locations can origin for instance from two different cameras of one vehicle or from the cameras of two different vehicle, see figures 11, 12 and 13.
  • the uncertainty detector of a first vehicle will send the coordinates of the unknown object to other vehicles in the vicinity of the scene, so another uncertainty detector within a receiving vehicle can match the segment received to the corresponding segment of the first vehicle. A match will increase the probability that the object is indeed of an unknown object class. A missing match will decrease this probability.
  • the uncertainty detector uses this means of plausibilization only if the potential unknown object is sufficiently far away so the entire procedure of vehicle-to-vehicle plausibilization takes no more time than the fault-tolerant-time-interval until it would be too late to initiate the minimal risk maneuver.
  • this means of plausibilization can always be applied.
  • the 3D position of the potential unknown object is required, which the uncertainty detector determines e.g. by means of monocular depth estimation, or by inference from motion.
  • the uncertainty detector within one vehicle will send information identifying the segment of the unknown object to other vehicles which do not have to be in the vicinity of the scene.
  • the uncertainty detector in the receiving vehicle will patch the received segment into an unrelated scene observed by the other vehicle and compute the uncertainty. If the segment has high uncertainty in the unrelated scene, the uncertainty detector increases a probability value indicating the proba- bility that the object is indeed of an unknown object class. If the identified segment has low uncertainty in the unrelated scene, the uncertainty detector will decrease the probability value reflecting the probability that the object is indeed of an unknown object class and instead assign the object class determined with low uncertainty to the object. In the case where this computation would take too much time, the uncertainty detector will perform this computation only locally within one vehicle, and the unrelated scene will be from an unrelated dataset, allowing for the further plausibilization with respect to the labels provided with this dataset.
  • the uncertainty detector preferably will create a dataset of new object classes from the unknown objects identified by the uncertainty detector.
  • the dataset will be comprised of the sensor input data - i.e. an input image pixel matrix, corresponding to the unknown object, together with a matrix of plausibilized uncertain pixels (in case of video input) or points (in case of LiDAR input) as labels, see figure 14.
  • the uncertainty detector is configured to group instances of yet unknown objects into candidate object classes by means of unsupervised segmentation.
  • the uncertainty detector is also configured to determine possible candidate object classes by means of one-shot learning where a visual input is mapped to a feature vector where each feature has an intrinsic meaning and an intrinsic relationship to other features, with features being e.g. descriptions in natural language; see figure 18.
  • a vehicle control system comprising a video camera, a semantic segmentation system connected to the video camera, a vehicle-to-vehicle (V2V) communication means allowing vehicle-to-vehicle communication for exchanging object class definitions/representations
  • the identification of yet unknown objects will be performed by the uncertainty detector on the device, e.g. a sensor where our technology is integrated with the neural network performing a computer vision task.
  • the uncertainty detector will record the sensor input data corresponding to the unknown object, together with the matrix of plausibilized uncertain pixels or points as labels.
  • the device implementing the uncertainty detector needs vehicle-to-vehicle connectivity, provided by other in-vehicle systems.
  • the creation of the dataset of new object clas- ses can be performed either by the uncertainty detector on the device or in the cloud. Therefore, the device needs vehicle-to-infrastructure communication, see figure 13. Detailed description of domain adaptation
  • Domain adaption serves for extending the operation domain of the perception system.
  • Domain adaptation preferably includes out-of-domain detection, for instance by way of identification of objects belonging to a yet unknown object class.
  • Domain adaptation preferably includes an adaptation or creation of a segmenting neural network so objects of one or more new object classes can be predicted by the segmenting neural network.
  • Figures 15 to 19 illustrate domain adaptation and in particular the process of enabling a segmenting neural network to predict objects of a new object class.
  • Figures 18 and 19 illustrate a continuous domain adaptation.
  • Domain adaptation can be performed either by the uncertainty detector on the device or in the cloud, see figures 16 and 17. This can happen by means of various, well-known techniques, in the simplest case by means of re-training the trained segmenting neural network with the newly determined object classes from the dataset of new object classes.
  • re-training of the segmenting neural network means updating the file containing the values of the weights in a persistent memory. During re-training, this file is updated, and the updated file can be loaded by the device, e.g. in the next on-off-cycle of the device. If re-training happens in the cloud, the updated file will be downloaded from the cloud to the persistent memory of the device by the software-update function of the device.
  • the operational design domain is the domain in which the system and in particular the perception subsystem can operate reliably.
  • the operational design domain is inter alia defined by the object classes the perception subsystem can discriminate.
  • Domain adaptation means that for instance the perception subsystem is newly configured or updated to be capable of discriminating further object classes that occur in the environment the perception subsystem is used in.
  • Data configuring the neural network i.e. data defining the weights and the activation functions of the nodes in the layers and of the filter kernel arrays- define a model that is represented by the neural network.
  • Downloading a model thus means downloading configuration data for the neural network so as to define a new model, e.g. a model that is capable to discriminate more object classes Updating a model with configuration data from another model
  • this neural network When the program instantiates a neural network, this neural network is usually uninitialized. There are then two modes of operation.
  • the program When the neural network shall be trained, the program will instantiate a neural network model from a software library, and its weights will here be initialized with random values, and the training procedure will gradually adapt the values of these weights.
  • the values of the weights of the neural network When the training is over, the values of the weights of the neural network will be saved in a file that corresponds to the architecture (structure, topology) of the neural network and to a generic storage file format, depending on the software library used for implementing the neural network, such like TensorFlow (by Google), PyTorch (by Facebook), Apache MXNet (by Microsoft), or the ONNX format (Open Neural Network Exchange).
  • TensorFlow by Google
  • PyTorch by Facebook
  • Apache MXNet by Microsoft
  • ONNX format Open Neural Network Exchange
  • checkpoint represents the model the neural network is trained for.
  • the program will again instantiate a neural network model from that software library, but for prediction, its weights will not be initialized with random values, but with the stored values forthe weights from the checkpoint file.
  • This neural network can then be used for prediction.
  • the checkpoint file always comprises data, which will be loaded by the program on every start of an on-off-cycle of the system.
  • the checkpoint file comprises data configuring the neural network - i.e. data defining the weights and the activation functions of the nodes in the layers and of the filter kernel arrays- define a model that is represented by the neural network.
  • Downloading a model thus means downloading the checkpoint file containing configuration data for the neural network so as to define a new model, e.g. a model that is capable to discriminate more object classes.
  • a program instantiates a neural network model, which - this time - is not initialized with random data for training, but with the values of weights from the checkpoint file. Then the training procedure is started with the new domain adaptation dataset. Accordingly, this training does not start from scratch but from a pre-trained state.
  • the new convolutional kernels will be initialized with random data. If, on the other hand, the new object classes are so similar to the object classes already known that it can be expected that they will generalize to the object classes already known, there is no need to change the architecture of the neural network. For training, in both cases, the val- ues of the weights of most layers in the neural network would be frozen (i.e. not amended) and only the last few layers would be trained for the adaptation. To update the system with the new segmenting neural network, a new checkpoint file is provided.
  • Additional training can be achieved by means of few shot learning (i.e. limited number of training data sets for a new object class).
  • segmenting neural network 40 is not directly accessible, i.e. in an autonomous driving system comprising a (primary) segmentation system.
  • a separate autonomous device 10' for instance a mobile device such as a smartphone, is provided in addition to the primary segmentation system.
  • the autonomous device 10' comprises a sensor, for instance a camera 12', for generating an input image pixel matrix, a segmenting neural network 40' (for instance a segmenting Baysian neural network) and an uncertainty detector 60' that is configured to determine regions with matrix elements (corresponding to pixels of the input image pixel matrix) that exhibit an uncertainty score above a threshold; see figure 20.
  • the uncertainty score map is composed of elements that correspond the pixels of the input image pixel matrix (66), each element of the uncertainty map having an uncertainty score that is determined by the uncertainty detector 60' and that reflects an amount of uncertainty involved in a class prediction for a corresponding element in the segment map 72' generated by the autonomous device 10'.
  • the uncertainty score map may comprise regions with matrix elements that exhibit an uncertainty score above a threshold and thus represent edge cases for the object classification and image segmentation.
  • the autonomous device 10' can be used stand alone or as a second segmentation system.
  • the autonomous device's 10' segmenting neural network 40' is trained with the same classes as the segmenting neural network 40 of the primary image segmentation system.
  • a (primary) image segmentation system of an autonomous driving system typically does not provide means for uncertainty detection while the autonomous device 10' comprises an uncertainty detector 60'. Accordingly, the autonomous device 10' can act as an a safety companion that can generate a user perceivable warning signal when a region in the uncertainty score map that is composed of elements having a high uncertainty score is found in a region of interest of the input image pixel matrix that represents the street in front of a vehicle.
  • a region in the uncertainty score map that is composed of elements having a high uncertainty score typically represents an unknown object and thus an edge case with respect to object recognition.
  • the autonomous device may comprise a new object detector 80 that is operatively connected to the uncertainty detector 60'.
  • the new object detector 80 is configured to find a region in the uncertainty score map that is composed of elements having a high uncertainty score and generate a new object class for such found region.
  • the autonomous device 10' can be configured to exchange labeled data sets comprising segments assigned to a newly generated object class with other autonomous devices 10' to thus increase the number of known object classes available for semantic segmentation by the autonomous devices.
  • the autonomous device may even record a user's (e.g. a driver's) reaction in response to a warning signal of the autonomous device.
  • the type of reaction (or the absence of any reaction) can be used as input data for discriminating relevant unknown objects from less relevant unknown objects.
  • the type of reaction (or the absence of any reaction) can also be used for automatically generating a label for a new object class.
  • An emergency stop or an evasive maneuver as a driver's reaction indicate a high relevance of the unknown object.
  • An absence of a user's reaction indicates a low relevance.
  • An automatically generated label may reflect the degree of relevance of a new object class.
  • the autonomous device 10' can learn to discriminate unknown objects by automatically generating new object classes. Furthermore, the autonomous device 10' can learn a relevance level for each (new) object class and thus can adopt the warning signal level to the object class. If the autonomous device 10' determines the presence of an object (by way of image segmentation) in a region of interest of the input image pixel matrix, the warning signal can be generated in dependence of the relevance of the recognized object.
  • the autonomous device 10' can be configured to exchange data sets comprising data representing the relevance level of a known object class or an observed user reaction (ob- served behavior) with other autonomous devices.
  • the autonomous device may comprise a data interface 82, in particular a wireless data interface 82 that allows data exchange over e.g. the internet.
  • the autonomous device 10' preferably is a pocketable, mobile device a user easily can carry and that can easily be mounted for instance to a windshield of a vehicle.
  • the autonomous device 10' is mounted in a position where the autonomous device's viewing angle at least in part corresponds to the viewing angle of the sensor 12 or sensors of the autonomous driving system.
  • the autonomous device may comprise a segmenting neural network 40' that implements a semantic segmentation model, which is preferably continuously trained with the output of the new object detector 80 as input to thus enable the neural network to predict new objects encountered before.
  • the output of the new object detector 80 is saved as a label of a dataset that further comprises the corresponding input image pixel matrix and thus suits as a training data set, see figures 21, 22 and 23.
  • the training data set can be transmitted to other autonomous devices over the internet, using the mobile data interface 82 of the autonomous device 10', i.e. the mobile phone.
  • the autonomous device 10' preferably is adapted for training and thus updating the semantic segmentation model implemented by the segmenting neural network 40' using train- ing data sets comprising new objects as received from other autonomous devices.
  • the autonomous device 10' can be enabled to predict edge cases encountered before by other, similar autonomous devices, see figure 24.
  • the segmenting neural network 40' of the autonomous device 10' implements a model which is trained with the output (i.e. the training dataset generated by the new object detector) of the new object detector 80 as input, and with the recorded user reaction as secondary input, in order to learn to predict the user's reaction when encountering the particular new object.
  • the source 84 providing the data representing the action performed by the user i.e. the driver
  • a user's reaction when encountering the particular new object can be used for generating a new object class and a label for the new object class.
  • the training data set together with data representing the action performed by the user can be transmitted to other autonomous devices over the internet, using the mobile data interface 82 of the autonomous device 10'.
  • the autonomous device 10' can be mounted behind the windshield of a vehicle with the autonomous device's camera facing in the direction of driving while executing a program that comprises an autonomous driving stack, connected to the output of the camera of the autonomous device as input, and providing a trajectory determined by the autonomous driving stack to the autonomous driving system of the vehicle over a connection between the autonomous device and the vehicle communications bus.
  • a program that comprises an autonomous driving stack, connected to the output of the camera of the autonomous device as input, and providing a trajectory determined by the autonomous driving stack to the autonomous driving system of the vehicle over a connection between the autonomous device and the vehicle communications bus.
  • the autonomous device 10' can be mounted behind the windshield of a vehicle with the camera facing in the direction of driving, executing a program that implements a prediction system of an autonomous driving stack, connected to the output of the camera of the autonomous device as input, and providing an object list determined by a perception system to the autonomous driving system of the vehicle over a connection be- tween the autonomous device and the vehicle communications bus.
  • the output of new object detector 80 is a label, with the so unknown object is labelled as a new object, and everything else labelled as background and assigned to an ignore class.
  • the sensor input together with the label is a training dataset.
  • Elements assigned to the ignore class will not cause a loss when the loss-function is applied during training of the semantic segmentation model with the training data set comprising the new object.
  • the training data set generated by the autonomous device 10' can be easily shared with other devices, without having to shift a massive amount of data since the training data set only comprises data that is relevant with respect to the new object class.
  • the autonomous device 10' can easily receive training data sets for new object classes from other autonomous devices as an input, see figure 25.
  • false positive new objects i.e. objects that are not new but already known but may have caused a high uncertainty score on the pixel level due to other circumstances
  • False positive new objects can be avoided due to context by means of determining the uncertainty of the new object when inserted into a known context by the autonomous driving system or another autonomous device.
  • a perception system 90 were a secondary system 92 and a primary sys- tern 94 of a vehicle are combined to form a system that can determine regions with elements exhibiting a high uncertainty level even in matrices provided by the primary system.
  • This is important because the primary system can be a proprietary system and thus a black box that in inaccessible from the outside. Accordingly, the primary system may not be modified for generating variance e.g. by means of Monte Carlo dropout to thus determine un- certainty scores.
  • the additional secondary system may, however, be configured similar to the autonomous system 10' disclosed herein before and thus is capable of identifying regions with elements exhibiting a high uncertainty score.
  • the primary system 94 comprises a segmenting neural network 40 and an object list generator 96 that generates a list of objects corresponding to the segments generated by the segmenting neural network 40.
  • the objects are aligned, associated, fused and managed in a sensor fusion system 98.
  • the sensor fusion system 98 is implemented by a Bayesian filter, which is a probabilistic robotics algorithm. This algorithm indicates its level of uncertainty to its successor, e.g. by means of the Kalman Gain. Based on this uncertainty, the successor will choose either the sensor input or the prediction made by the Bayesian filter.
  • the primary system 94 is not Bayesian, so the Bayesian filter in the sensor Fusion system 98 assumes a static uncertainty, calibrated only once for each sensor. With the secondary system 92, the perception system 90 is enabled to provide uncertainty, see figure 26.
  • the secondary system 92 preferably comprises multiple parallel redundant uncertainty detectors 60, e.g. a Bayesian neural network, a variational autoencoder, an object detector 80 trained with edge cases already encountered, and a new object detector 80 trained with new objects encountered but to be suppressed.
  • multiple parallel redundant uncertainty detectors 60 e.g. a Bayesian neural network, a variational autoencoder, an object detector 80 trained with edge cases already encountered, and a new object detector 80 trained with new objects encountered but to be suppressed.
  • the output of multiple parallel redundant uncertainty detectors in a parallel redundant architecture is matched on a per pixel basis, e.g. by the per pixel sum or by the per pixel maximum over the uncertainties or by means of a Bayesian filter, e.g. a particle filter.
  • a semantic segmentation model is trained with a new object training dataset as input to the semantic segmentation model and the output the semantic segmentation model and to feed the outputs of one or multiple parallel redundant uncertainty detectors as input to a supervisor model that is trained to predict a semantic segmentation with two classes, indicating if an entire segment is considered correctly or incorrectly predicted.
  • aleatoric uncertainty with the segment boundaries in a semantic segmentation is matched by means of a rule-based system.
  • pixel uncertainties are aggregated into segment uncertainties, e.g. by means of a threshold over a sum or a variance over pixel uncertainties.
  • a semantic segmentation model is trained with a new object training dataset as input to the semantic segmentation model and to feed the output of the semantic segmentation model and the outputs of one or multiple parallel redundant uncertainty detectors as input to a supervisor model that is trained to predict a semantic segmentation with two classes, indicating if an entire segment is considered correctly or incorrectly predicted.
  • a system as shown in figure 27 may be provided.
  • the system is configured to determine the loss between a segment map 54A generated by the unknown segmenting neural network from the input image pixel matrix of a training data set and a segment map 54B provided with the training dataset.
  • the segment map 54B provides the labels for the input image matrix as generated by the segmenting neural network that has generated the training dataset.
  • the system shown in figure 27 is configured to determine the loss (determined by loss function 76) between the segment map 54A generated by the unknown segmenting neural network and the segment map 54B belonging to the training dataset.
  • a high loss indicates where the labels provided with the training dataset differ from the labels (i.e.
  • Elements or pixel that exhibit a high loss can be assigned to an ignore class to thus avoid that these image parts can affect the training of the Bayesian segmenting neural network by way of transfer learn- ing with a training dataset as for instance generated by an autonomous system as disclosed above.
  • elements assigned to the ignore class by means of loss function 76 will be ignored by the loss function 78 that is used for training the Bayesian segmenting neural network 40.
  • the Bayesian segmenting neural network 40 can, for instance, be the segmenting neural network of the autonomous device 10' while the unknown trained segmenting neural network can be the segmenting neural network of an autonomous driving system.
  • Figure 28 illustrates the training of a variational autoencoder 62', by way of transfer learning with a dataset as input to a (known or unknown) trained segmenting neural network.
  • the dataset is used as input to the variational autoencoder.
  • the dataset used as input forthe variational autoencoder is modified by computing a loss forthe pixels predicted by the trained segmenting neural network and assigning pixels with a high loss to the ignore class when training the variational autoencoder.
  • the variational autoencoder 62' may implement a known model of a trained segmenting neural network.
  • Figure 29 illustrates uncertainty detection by means of variational autoencoder 62' (as a generative neural network) in order to determine pixels with a high loss due to uncertainty.
  • an input dataset for instance an input image pixel matrix
  • a trained neural network e.g. for semantic segmentation that comprises an unknown object
  • the pixels representing the unknown object should exhibit a high uncertainty score.
  • the input dataset 66 is fed as an input to the variational autoencoder 62' and to a loss function 80.
  • the prediction 82 of the variational autoencoder 62' is also fed to the loss function 80 in order to determine the loss between the input image pixel matrix data set 66 that may comprise data representing an unknown object and the prediction 82 (output data set) of the variational autoencoder 62'.
  • a loss for the pixels predicted by the trained neural network can be computed and an uncertainty score map 74' can be generated accordingly.
  • the uncertainty score map 74' comprises a segment with pixels having a high uncertainty score where pixels of the input image pixel matrix 66 represent an unknown object (yellow box on figure 29).
  • the variational autoencoder For training the variational autoencoder (see figure 28) pixels with a high loss can be ignored.
  • the variational autoencoder used for determining uncertainty i.e. pixels with a high loss
  • the variational autoencoder used for determining uncertainty is configured as a Bayesian neural network that applies variation for instance by way of Monte Carlo drop out.
  • the reliability of the loss deter- mined with the help of the variational autoencoder can be determined by way of variational inference.
  • Figure 30 illustrates the use of one-shot learning in order for an uncertainty detector to suggest the most likely class for a pixel belonging to an unknown object by similarity to the known classes. This approach can also be used for generating an automatically labeling new object classes.
  • Figure 31 illustrates that multiple parallel redundant uncertainty detectors 60 can be executed in a parallel redundant architecture, with all uncertainty detectors 60 sharing one common encoder network.
  • the device 100 can be a smartphone that can be mounted behind the wind shield of a vehicle or held in hand.
  • the device 100 can be equipped with one or more cameras 102 for generating a video stream 104 that is fed to a neural engine 106 or a similar processing unit.
  • the neural engine 106 is configured to generate an object list 108.
  • an output terminal 110 is provided.
  • the output terminal can for example be a USB terminal (Universal Serial Bus) or an Ethernet terminal.
  • the output terminal 110 can also be a wireless terminal using wireless local area network (WLAN, WiFi) protocol or a Bluetooth interface.
  • WLAN wireless local area network
  • input terminals may be provided in addition or as an alternative to the camera or cameras 102.
  • the input terminal may be a universal serial bus terminal (USB-Terminal) an Ethernet terminal or a wireless local area network terminal or a Bluetooth terminal.
  • USB-Terminal universal serial bus terminal
  • Such input terminal enables the device 100 to receive a video stream from one or more external cameras, devices or smartphones connected to the device 100.
  • the device 100 can be configured to generate (by means of neural engine 106) and providing the object list to further devices.
  • the object list 108 may comprise positions with two dimensional or three dimensional coordinates of objects preferably, the coordinates localize an object in the camera coordinate system. Further, each position is preferably anno- fated with a class, e.g. class of a recognized object.
  • the position may be annotated with a time interval and the position may be further annotated with an uncertainty score.
  • the list of objects and positions 108 comprises for each recognized object an identifier for the object, coordinates that identify the position of the recognized object in the camera coordinate system, a time stamp providing information about the time interval when the recognized object was at the position identified by the coordinates and an uncertainty score that provides information about how reliable the object recognition is with respect to the particular recognized object.
  • the position of the recognized object can for instance be encoded by a polygon around the object in polar coordinates. Since the objects are recognized in frames of a video stream it is also possible to determine how the position of an object changes from frame to frame. This allows to generate trajectories that combine positions of two dimensional or three dimensional coordinates with the optional annotations and a time in the future. Such trajectory describes where a certain object is expected to be at that time in the future.
  • the annotated list of objects and positions can be forwarded to another, connected device by any one of the interfaces mentioned earlier, i.e. USB, Ethernet, W-Lan and/or Bluetooth.
  • the device 100 further may be adapted to provide a video output 112 and/or to provide acoustic signals or messages via an audio output 116.
  • the device may further be configured to trigger start or stop of recording.
  • Figure 32 illustrates a configuration of the neural engine that is suitable for object recogni- tion.
  • Neural engine 106 implements a segmenting neural network 120 with an encoder module 122 and a decoder module 124.
  • the encoder 122 comprises a down-sampling module 126 including an input layer for down-sampling an input image pixel matrix (for instance a frame of a video stream).
  • the layers for down- sampling the input image pixel matrix may implement variational inference (as described herein further above) with partial model sampling and/or with consecutive video sampling and/or with a Lipschitz constraint for spectral normalization of the soft max values.
  • the layers for down-sampling may be configured to process simultaneous inputs or a number of different points in time. This can be achieved by a temporal shift of video frames or input image pixel matrices respectively.
  • the encoder 122 also provides a feature extraction module 128.
  • the encoder layers for feature extraction may implement temporal shift and/or variational inference optionally with partial model sampling, and/or with consecutive video sampling and/or with a Lipschitz constraint for spectral normalization of the soft max values.
  • the results of the feature extraction are provided to a feature fusion module 130 of the decoder 124 of the segmenting neural network.
  • Feature fusion preferably is achieved by means of a Kalman feature wherein feature matching is achieved via the feature position as determined by the feature extraction module 128 of the encoder module 122.
  • the feature tensors generated by the feature extraction module 128 of the encoder 122 capture spatial details of the input image pixel matrixes.
  • the classifier module 132 of the decoder classifies each pixel of the input image pixel matrix by assigning each pixel to one of the object classes, the segmenting neural network is trained for.
  • the segmenting neural network as shown in figure 32 can be used for object recognition and is capable of recognizing objects (i.e. determining segments with pixel that belong to a certain object class the segmenting neural network was trained for) and by encoding the segment with a polygon around the recognized object.
  • the polygon is annotated with a class of the recognized object.
  • the segmenting neural network of figure 32 comprises an encoder- decoder architecture and receives an image (i.e. a frame) from an input video stream.
  • the segmenting neural network generates one feature map per object class with a soft max value of pixel-vice segment predictions.
  • the encoder 122 of the semantic segmentation model of figure 33 comprises a down-sampling module 126 for down-sampling of the input image pixel matrix and a feature extraction module 128 for feature extraction as indicated by the dashed lines, the feature extraction module 130 may be configured to put out infor- mation regarding the context and/or information regarding spatial detail. Information regarding context and/or spatial detail can be fed to the feature fusion module 130 of the decoder to thus allow position and/or context based feature fusion.
  • the feature fusion module 130 of the decoder is optional and preferably is provided in case of multiple inputs from different layers of the encoder. Preferably, at least one input is di- rectly provided from the last layer of the down-sampling module 126 of the encoder 12. Further, the feature fusion module may be provided with one or more inputs from the inner layers of the feature extraction module.
  • the semantic segmentation module as shown in figure 33 may receive a secondary input from a depth estimation module as illustrated in figure 37.
  • the semantic segmentation module may also implement polygon regression.
  • the annotated list of objects may comprise an annotation with a relative uncertainty score indicating the reliability of the classification and potentially hinting to a yet unknown class of object in case the uncertainty score is high.
  • the object list generated by the segmenting neural network as shown in figure 33 may comprise annotations that identify a proposed similar class of object in case of a high degree of uncertainty.
  • neural engine 106 is configured to extract features from one or more input image pixel matrixes, i.e. frames of video streams.
  • the models implemented by the neural networks of the neural engine 106 may be configured to extract one or more of the following features: classes of objects (figures 32 and 33) - classes of gestures (figures 34, 35, 36 and 37) classes of actions classes of awareness (figure 38) and/or forecasts of classes of action (figures 39 and 40)
  • the neural engine 106 implements one or more neural networks that preferably have an encoder-decoder architecture.
  • the input data set fed to a respective input layer of a down-sampling module of the encoder of the respective neural network depends on the feature to be extracted.
  • the input data sets are input image pixel matrixes (frames) of one or more video input streams provided by one or more cameras.
  • the features to be extracted are classes of gestures signaled by a person or a vehicle, preferably encoded by a polygon around the object, a neural network implementing a video action or recognition model as illustrated in figure 35, optionally receiving a secondary input from a depth estimation model as illustrated in figure 37 can be used.
  • the feature to be extracted is a class of awareness of a person or a vehicle, preferably encoded by a polygon around the object, a neural network implementing a semantic segmentation model as illustrated in figure 33 and/or a video action recognition model as illus- trated in figure 35 may be used.
  • a neural network implementing an action forecast model as illustrated in figure 40 may be used.
  • the model preferably is configured to receive a video input stream.
  • the semantic segmentation model preferably provides a list with classes of objects and positions; see figure 33 class of gesture recognition, the video action recognition model preferably provides one anchor per recognized object encoding a polygon around the object class of gesture of action performed by that object class of action recognition, the video action recognition model as illustrated in figure 35, preferably receiving a second input from a depth estimation model as illustrated in figure 37, provides an annotation of the class of action performed by a recognized object, the recognized object preferably being a person or a vehicle, - recognition of a class of awareness of a person or a vehicle, the neural network preferably provides an annotation of a class of awareness for persons or vehicles recognized by means of the semantic segmentation module of figure 33 and/or the video action recognition model of figure 35, and a forecast of a class of action to be performed by a person or vehicle, preferably encoded by a polygon around the object, by means of an action forecast module as illustrated in figure 40, one anchor per object encoding
  • the models may share identical encoders having identical down-sampling modules and identical feature extraction models. Further, the models may share identical feature fusion modules of the respective decoder. However, the classification module of the respective decoder varies depending on the feature to be extracted.
  • semantic segmentation model For object recognition, preferably semantic segmentation model according to figure 33 is provided.
  • the model preferably implements an encoder-decoder architecture and receives an image from a video input stream.
  • the semantic segmentation model generates one feature map per class with the softmax values of pixel-wise segment predictions.
  • the encoder 122 comprises a down-sampling module 126 for down-sampling of the input image pixel matrix.
  • the encoder 122 further comprises a feature extraction module 128 for feature extraction.
  • the encoder preferably provides parallel instantiation for context and/or for spatial detail (see dashed lines). This means that information regarding the context or spatial detail is fed to the feature fusion module 130 to facilitate feature fusion.
  • the decoder 124 preferably includes a feature fusion module 130 in case of multiple inputs from different layers of the encoder 122.
  • the feature fusion module 130 preferably receives one input directly from the last layer of the down-sampling module 126 and/or preferably one input from the last layer of the feature extraction module 128 and/ or one or multiple inputs from the inner layers of the feature extraction module 128 (instantiation of context and spatial detail).
  • the decoder 124 preferably includes a classification module 132 for pixel-wise classification.
  • the semantic segmentation model may receive a secondary input from a depth estimation model as illustrated in figure 37, with their combination outlined in figure 51 and following.
  • the semantic segmentation model preferably implements polygon regression.
  • a video action recognition model for class of gesture recognition, preferably a video action recognition model according to figure 35 is provided.
  • the video action recognition model preferably implements an encoder-decoder architecture and receives a video input stream.
  • the video action recognition model generates one anchor per recognized object encoding a polygon around the recognized object and the class of gesture or action performed by that recognized object.
  • the encoder 122 comprises a down-sampling module 126 for down-sampling of the input image pixel matrix.
  • the encoder 122 further comprises a feature extraction module 128 for feature extraction.
  • the encoder 122 preferably provides parallel instantiation for context and/or for spatial detail (see dashed lines). This means that information regarding the context or spatial detail is fed to the feature fusion module 130 to facilitate feature fusion.
  • the encoder 122 preferably is configured for polygon-wise regression of the polygon and/or classification of objects and/or classification of gesture or action.
  • the encoder 122 preferably comprises one or more temporal shift modules as illustrated in figure 36. Such temporal shift module 140 may be inserted into the down-sampling module 126, and/or inserted into the feature extraction module 128 and/or inserted into the feature fusion module 130.
  • the decoder 124 preferably includes a feature fusion module 130 in case of multiple inputs from different layers of the encoder 122.
  • the feature fusion module 130 preferably receives one input directly from the last layer of the down-sampling module 126 and/or preferably one input from the last layer of the feature extraction module 128 and/ or one or multiple inputs from the inner layers of the feature extraction module 128 (instantiation of context and spatial detail).
  • the decoder 124 preferably includes a classification module 132 for generating one anchor per object encoding a polygon around the object and the class of gesture or action performed.
  • the video action recognition model may receive a secondary input from a depth estimation model as illustrated in figure 37, with their combination outlined in figure 51 and following.
  • the classification module 132 is based on the head of Poly-YOLO, https://arxiv.org/abs/2005.13243.
  • Optional temporal shift modules 140 for detection of ac- tions over time by means of providing simultaneous input of m different points in time (10) are added. Further preferred features are variational inference (13) with partial model sampling (12), and with consecutive video sampling (11) and a Lipschitz constraint for spectral normalization of the softmax values (14).
  • the classification modules 132 of the video action recognition model as illustrated in figure 35 and the action forecast model as illustrated in figure 40 are differently trained, i.e. trained with different training data sets and thus are different.
  • the depth estimation model as illustrated in figure 37 preferably implements an encoder- decoder architecture.
  • the encoder 122 comprises a down-sampling module 126 for down-sampling of the input image pixel matrix.
  • the encoder 122 further comprises a feature extraction module 128 for feature extraction.
  • the encoder 122 preferably provides parallel instantiation for context and/or for spatial de- tail (see dashed lines). This means that information regarding the context or spatial detail is fed to the feature fusion module 130 to facilitate feature fusion.
  • the encoder 122 preferably comprises one or more temporal shift modules as illustrated in figure 36.
  • Such temporal shift module 140 may be inserted into the down-sampling module 126, and/or inserted into the feature extraction module 128 and/or inserted into the feature fusion module 130.
  • the decoder 124 preferably includes a feature fusion module 130 in case of multiple inputs from different layers of the encoder 122.
  • the feature fusion module 130 preferably receives one input directly from the last layer of the down-sampling module 126 and/or preferably one input from the last layer of the feature extraction module 128 and/ or one or multiple inputs from the inner layers of the feature extraction module 128 (instantiation of context and spatial detail).
  • the decoder 124 preferably includes a classification module 132 for depth estimation.
  • the depth estimation model may receive a secondary input from a semantic segmentation model as illustrated in figure 33, with their combination outlined in figure 51 and following.
  • the video action recognition model may further generate annotations with respect to relative value of uncertainty indicating an unknown class of gesture or action and/or indicating the case where another object is mistaken for a person or vehicle.
  • all dashed connections are optional.
  • Optional connectors 7, 8, 9 were added for parallel instantiation of context and spatial detail.
  • the classification module 132 is based on the head of Poly-YOLO, htps://arxiv.org/abs/2005.13243.
  • Optional temporal shift modules 140 for detection of ac- tions over time by means of providing simultaneous input of m different points in time (10) are added. Further preferred features are variational inference (13) with partial model sampling (12), and with consecutive video sampling (11) and a Lipschitz constraint for spectral normalization of the softmax values (14).
  • an action forecast model according to figure 40 is provided.
  • the action forecast model preferably implements an encoder-decoder architecture and receives a video input stream.
  • the video action forecast model generates one anchor per recognized object encoding a polygon around the recognized object, the class of action to be performed by that object, and a time interval for which the class of action is anticipated.
  • the encoder 122 comprises a down-sampling module 126 for down-sampling of the input image pixel matrix.
  • the encoder 122 further comprises a feature extraction module 128 for feature extraction.
  • the encoder 122 preferably provides parallel instantiation for context and/or for spatial de- tail (see dashed lines). This means that information regarding the context or spatial detail is fed to the feature fusion module 130 to facilitate feature fusion.
  • the encoder 122 preferably is configured for polygon-wise regression of the polygon and/or classification of objects and/or classification of gesture or action.
  • the encoder 122 preferably comprises one or more temporal shift modules as illustrated in figure 36. Such temporal shift module 140 may be inserted into the down-sampling module 126, and/or inserted into the feature extraction module 128 and/or inserted into the feature fusion module 130.
  • the decoder 124 preferably includes a feature fusion module 130 in case of multiple inputs from different layers of the encoder 122.
  • the feature fusion module 130 preferably receives one input directly from the last layer of the down-sampling module 126 and/or preferably one input from the last layer of the feature extraction module 128 and/ or one or multiple inputs from the inner layers of the feature extraction module 128 (instantiation of context and spatial detail).
  • the decoder 124 preferably includes a classification module 132 for generating one anchor per recognized object encoding a polygon around the recognized object, the class of action to be performed by that object, and a time interval for which the class of action is antici- pated.
  • the action forecast model may receive a secondary input from a semantic segmentation model as illustrated in figure 33, with their combination outlined in figure 51 and following.
  • the action forecast model may receive a secondary input from a depth estimation model as illustrated in figure 37, with their combination outlined in figure 51 and following.
  • the action forecast model may receive a secondary input from a video action recognition model as illustrated in figure 35, with their combination outlined in figure 51 and following.
  • the video action recognition model may further generate annotations with respect to relative value of uncertainty indicating the case where another object is mistaken for a person or vehicle.
  • all dashed connections are optional.
  • Optional connectors 7, 8, 9 were added for parallel instantiation of context and spatial detail.
  • the classification module 132 is based on the head of Poly-YOLO, https://arxiv.org/abs/2005.13243.
  • Optional temporal shift modules 140 for detection of actions over time by means of providing simultaneous input of m different points in time (10) are added. Further preferred features are variational inference (13) with partial model sampling (12), and with consecutive video sampling (11) and a Lipschitz constraint for spectral normalization of the softmax values (14).
  • the classification modules 132 of the video action recognition model as illustrated in figure 35 and the action forecast model as illustrated in figure 40 are differently trained, i.e. trained with different training data sets and thus are different. Training of the classification module 132 of the action forecast model as illustrated in figure 40 can be performed by unsupervised learning using labels for training data sets that are generated by the video action recognition model as illustrated in figure 35. For doing so, time shift is applied to learn anticipated actions for action forecasting from a sequence of previously recognized actions as recognized by the video action recognition model as illustrated in figure 35.
  • temporal shift is implemented for detection of actions over time by means of providing simultaneous input of different points in time.
  • a temporal shift module 140 as illustrated in figure 36 can be provided.
  • the temporal shift module 140 is based on https://arxiv.Org/abs/1811 .08383.
  • Optional support for m features in the temporal dimension instead of only one is added.
  • Figure 34 illustrates the use case of gesture recognition.
  • the gestures to be recognized are “turning of head” of the recognized object “bicyclist” or “signaling turn” of the recognized object “bicyclist”.
  • Figure 38 illustrates awareness recognition. Depending on the orientation of a bicyclists head, the recognized object “bicyclist” are provided with annotation representing a class of awareness, i.e. “unaware” or “aware” as in all other cases illustrated herein, for each class annotation (e.g. “unaware” or “aware”) an uncertainty score can be determined and assigned to the class annotation.
  • Figure 39 illustrates action forecast for the recognized object “bicyclist”. Depending on the situation (recognized lanes and recognized non-moving object “car” a forecast for the position of the object “bicyclist” is generated.
  • the models described above can be used to implement functions, e.g. a. to derive controllability of a situation from the awareness of persons or vehicles of the ego vehicle b. to derive a risk estimation for each person or vehicle i. with severity
  • the models described above can also be used to implement use cases, e.g. a. for any action performed by a person or vehicle which the model has been trained to recognize i. generate a warning ii. and/ or annotate the video recording with the license plate of the vehicle iii. and/ or provide this information to an autonomous driving system iv. when a person, bicyclist or motorcyclist is about to cross the street
  • the cascaded monitoring concept 150 may comprise two parallel situation monitors 152 to recognize edge cases with respect to the dataset, deriving segments from pixel-wise uncertainty, encoded by a polygon around the segment.
  • a primary situation monitor 152 implements a model to match the pixel-wise reconstruction loss of an autoencoder model to segments by means of a model with the same topology as the classification module of the decoder of the action recognition model.
  • the model may be configured for matching pixel-wise epistemic uncertainty to segment boundaries with pixel-wise epistemic uncertainty provided by the segmentation model and/ or by the depth estimation model and/ or optionally by the autoencoder model.
  • the model may be configured for matching pixel-wise aleatoric uncertainty to segment boundaries to recognize the case where multiple overlapping segments have the same class and there is not enough epistemic uncertainty.
  • Figure 43 illustrates a situation monitor 154 based on the head of Poly-YOLO, https://arxiv.org/abs/2005.13243 ' ).
  • the situation monitor 154 implements an autoencoder model 160 to provide the recon- struction loss.
  • the model preferably implements an encoder-decoder architecture and receives an image from a video input stream.
  • the semantic segmentation model generates one feature map per class with the softmax values of pixel-wise segment predictions.
  • the encoder 122 comprises a down-sampling module 126 for down-sampling of the input image pixel matrix.
  • the encoder 122 further comprises a feature extraction module 128 for feature extraction.
  • the encoder preferably provides parallel instantiation for context and/or for spatial detail (see dashed lines). This means that information regarding the context or spatial detail is fed to the feature fusion module 130 to facilitate feature fusion.
  • the decoder 124 preferably includes a feature fusion module 130 in case of multiple inputs from different layers of the encoder 122.
  • the feature fusion module 130 preferably receives one input directly from the last layer of the down-sampling module 126 and/or preferably one input from the last layer of the feature extraction module 128 and/ or one or multiple inputs from the inner layers of the feature extraction module 128 (instantiation of context and spatial detail).
  • the decoder 124 preferably includes a classification module 132 for reconstruction of the input image and computation of a pixel-wise reconstruction loss.
  • Figure 44 illustrates an autoencoder model 160. All dashed connections are optional. Optional connectors 7, 8, 9 were added for parallel instantiation of context and spatial detail. Depth estimation module 132 is independently designed with symmetry to the classification module. Optional temporal shift modules 140 for detection of actions over time by means of providing simultaneous input of m different points in time (10) are added. Further preferred features are variational inference (13) with partial model sampling (12), and with consecutive video sampling (11) and a Lipschitz constraint for spectral normalization of the softmax values (14).
  • the other situation monitor 152 implements a model to match the pixel-wise reconstruction loss of an autoencoder model to segments.
  • the other situation monitor 152 comprises two parallel validity monitors 156 and 158.
  • One validity monitor is a Bayesian validity monitor 158 comprising - dropout layers, inserted into convolutional submodules, for variational inference, deriving uncertainty by means of sampling under dropout and a sampler for sampling the model with partial model sampling and/or with consecutive video sampling. Matching the edge cases found to the segments can be found by means of a probabilistic filter (e.g. Kalman filter).
  • a probabilistic filter e.g. Kalman filter
  • Figure 45 illustrates a convolutional submodule
  • Figure 46 illustrates a convolutional submodule with dropout. This module can be based on htps://arxiv.Org/abs/1703.04977 and htps://mlg.eng.cam.ac.uk/varin/thesis/thesis.pdf.
  • Figure 47 illustrates a Bayesian sampling module
  • the other validity monitor 156 can be a Lipschitz validity monitor, inserted into convolutional submodules, normalizing softmax to correlate with uncertainty optionally provided to the situation monitor by means of the softmax value and/ or optionally provided to the situation monitor the softmax variance.
  • Figure 48 illustrates a Lipschitz submodule. This module can be based on https://arxiv.org/abs/2102.11582 and https://arxiv.org/abs/2106.02469.
  • Uncertainty can be provided to probabilistic filters.
  • Figure 49 illustrates an integration of the situation monitor with a Kalman filter Preferably, similar classes in case of uncertainty are predicted.
  • Figure 50 illustrates a use case of similarity prediction
  • all or a subset of the models used in the extraction of features are integrated within a single (multi-head) model, consisting of one or more shared encoders.
  • the multihead model may comprise a segmentation head and/ or a depth estimation head and/ or an autoencoder head and/ or a video action recognition head and/ or an action forecast model.
  • Figure 51 is an overview over a multi-functional multi-head model
  • Figure 52 illustrates an encoder of a multi-functional model. All dashed connections are optional. Temporal shift (10), variational inference (13) with partial model sampling (12), and with consecutive video sampling (11) and a Lipschitz constraint (14), have been added, each one optionally.
  • Figure 53 illustrates a segmentation head (i.e. classification module) of a multi-functional model. All dashed connections are optional. A temporal shift, a Lipschitz constraint and dropout for variational inference to all layers of the decoder and a Lipschitz constraint to all layers of the situation monitor have been added (each one optionally). The situation monitor processes input from the Bayesian variance of variational inference (optionally), and/ orthe softmax values normalized by the Bi-Lipschitz constraint (optionally), and/ or the softmax variance normalized by the Bi-Lipschitz constraint (optionally).
  • Figure 54 illustrates a video action recognition head (i.e. classification module) of a multifunctional mode. All dashed connections are optional. A temporal shift, a Lipschitz constraint and dropout for variational inference to all layers have been added (each one optionally).
  • Figure 55 illustrates a depth estimation head (i.e. classification module) of a multi-func- tional model. All dashed connections are optional. A temporal shift, a Lipschitz constraint and dropout for variational inference to all layers of the decoder and a Lipschitz constraint to all layers of the situation monitor have been added (each one optionally).
  • the situation monitor processes input from the Bayesian variance of variational inference (optionally), and/ or the softmax values normalized by the Bi-Lipschitz constraint (optionally), and/ or the softmax variance normalized by the Bi-Lipschitz constraint (optionally).
  • Figure 56 illustrates an autoencoder head (i.e. classification module) of a multi-functional model. All dashed connections are optional.
  • a Lipschitz constraint is added to all layers of the decoder and to all layers of the situation monitor (each one optionally).
  • the situation monitor processes input from the pixel-wise reconstruction loss of the autoencoder head (optionally), and/ or the softmax values normalized by the Bi-Lipschitz constraint (optionally), and/ orthe softmax variance normalized by the Bi-Lipschitz constraint (optionally).
  • Figure 57 illustrates an action forecast head (i.e. classification module) of a multi-functional model. All dashed connections are optional. A temporal shift, a Lipschitz constraint and dropout for variational inference to all layers have been added (each one optionally).
  • Figure 58 illustrates how the system can be implemented following the Sense-Plan-Act concept. Perception, planning and control subsystems shall each have their own hardware. The perception systems, in particular, shall be integrated with their sensor hardware. The map shall be instantiated on the hardware of the planning subsystem.
  • the sensors preferably comprise six vision sensors for the near range, each 120 degrees field of view, each 40 degrees overlapping and each instantiating one separate perception subsystem; see figure 59.
  • the sensors preferably comprise one vision sensor with 60 degrees field of view, instantiating one separate perception subsystem for the medium range.
EP22730378.1A 2021-05-17 2022-05-17 System zur erkennung und verwaltung von unsicherheit in wahrnehmungssystemen, zur erkennung neuer objekte und zur situationsvorausnahme Pending EP4341913A2 (de)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
EP21174146 2021-05-17
EP21204038 2021-10-21
PCT/EP2022/063359 WO2022243337A2 (en) 2021-05-17 2022-05-17 System for detection and management of uncertainty in perception systems, for new object detection and for situation anticipation

Publications (1)

Publication Number Publication Date
EP4341913A2 true EP4341913A2 (de) 2024-03-27

Family

ID=82058472

Family Applications (1)

Application Number Title Priority Date Filing Date
EP22730378.1A Pending EP4341913A2 (de) 2021-05-17 2022-05-17 System zur erkennung und verwaltung von unsicherheit in wahrnehmungssystemen, zur erkennung neuer objekte und zur situationsvorausnahme

Country Status (2)

Country Link
EP (1) EP4341913A2 (de)
WO (1) WO2022243337A2 (de)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9201421B1 (en) * 2013-11-27 2015-12-01 Google Inc. Assisted perception for autonomous vehicles
US10803328B1 (en) * 2017-11-15 2020-10-13 Uatc, Llc Semantic and instance segmentation
CA3137030A1 (en) * 2019-05-31 2020-12-03 Maryam ZIAEEFARD Method and processing device for training a neural network
US11314258B2 (en) * 2019-12-27 2022-04-26 Intel Corporation Safety system for a vehicle

Also Published As

Publication number Publication date
WO2022243337A2 (en) 2022-11-24
WO2022243337A9 (en) 2023-03-09
WO2022243337A3 (en) 2023-01-12

Similar Documents

Publication Publication Date Title
US11634162B2 (en) Full uncertainty for motion planning in autonomous vehicles
JP7430277B2 (ja) 障害物検出方法及び装置、コンピュータデバイス、並びにコンピュータプログラム
CN110363058B (zh) 使用单触发卷积神经网络的用于避障的三维对象定位
KR101995107B1 (ko) 딥 러닝을 이용한 인공지능 기반 영상 감시 방법 및 시스템
EP3511863B1 (de) Verteilbares darstellungslernen zum verknüpfen von beobachtungen aus mehreren fahrzeugen
US20230386167A1 (en) System for detection and management of uncertainty in perception systems
CN108388834A (zh) 利用循环神经网络和级联特征映射的对象检测
JP2022516288A (ja) 階層型機械学習ネットワークアーキテクチャ
Itkina et al. Dynamic environment prediction in urban scenes using recurrent representation learning
Wei et al. Survey of connected automated vehicle perception mode: from autonomy to interaction
US11172168B2 (en) Movement or topology prediction for a camera network
Tsintotas et al. Tracking‐DOSeqSLAM: A dynamic sequence‐based visual place recognition paradigm
Kolekar et al. Behavior prediction of traffic actors for intelligent vehicle using artificial intelligence techniques: A review
US11875680B2 (en) Systems and methods for augmenting perception data with supplemental information
Sharma et al. Vehicle identification using modified region based convolution network for intelligent transportation system
US20230230484A1 (en) Methods for spatio-temporal scene-graph embedding for autonomous vehicle applications
US20220309794A1 (en) Methods and electronic devices for detecting objects in surroundings of a self-driving car
JP2022164640A (ja) マルチモーダル自動ラベル付けと能動的学習のためのデータセットとモデル管理のためのシステムと方法
Nagrath et al. Understanding new age of intelligent video surveillance and deeper analysis on deep learning techniques for object tracking
Jagadish et al. Autonomous vehicle path prediction using conditional variational autoencoder networks
BOURJA et al. Real time vehicle detection, tracking, and inter-vehicle distance estimation based on stereovision and deep learning using YOLOv3
WO2023017317A1 (en) Environmentally aware prediction of human behaviors
EP4341913A2 (de) System zur erkennung und verwaltung von unsicherheit in wahrnehmungssystemen, zur erkennung neuer objekte und zur situationsvorausnahme
EP4281945A1 (de) Statische belegungsverfolgung
CN117716395A (zh) 用于检测和管理感知系统中的不确定性、新对象检测和情形预期的系统

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20231215

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR