CN117716395A - System for detecting and managing uncertainty, new object detection and situation expectations in a perception system - Google Patents

System for detecting and managing uncertainty, new object detection and situation expectations in a perception system Download PDF

Info

Publication number
CN117716395A
CN117716395A CN202280050415.2A CN202280050415A CN117716395A CN 117716395 A CN117716395 A CN 117716395A CN 202280050415 A CN202280050415 A CN 202280050415A CN 117716395 A CN117716395 A CN 117716395A
Authority
CN
China
Prior art keywords
uncertainty
neural network
class
score
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280050415.2A
Other languages
Chinese (zh)
Inventor
R·梅福斯
S·福尔斯特
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Deep Security Co
Original Assignee
Deep Security Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Deep Security Co filed Critical Deep Security Co
Priority claimed from PCT/EP2022/063359 external-priority patent/WO2022243337A2/en
Publication of CN117716395A publication Critical patent/CN117716395A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

According to the invention, a sensing system is provided, the sensing system comprising a segmented neural network (40) and an uncertainty detector (60). The segmented neural network (40) is configured and trained for segmentation of the input image pixel matrix, thereby generating segments composed of elements corresponding to pixels in the input image pixel matrix. By class prediction, each element in the segment map is assigned to one of a plurality of object classes for which the segmented neural network is trained. Elements assigned to the same object class form segments in the segment map. An uncertainty detector (60) is configured to generate an uncertainty score that is made up of elements corresponding to pixels in a matrix of input image pixels. Each element in the uncertainty map has an uncertainty score that is determined by an uncertainty detector and reflects the amount of uncertainty involved in the class prediction for the corresponding element in the segment map.

Description

System for detecting and managing uncertainty, new object detection and situation expectations in a perception system
The present invention relates to a system for detecting and managing uncertainty in a perception system. In other embodiments, the present invention relates to systems for new object detection and/or situational awareness based on detected uncertainty.
One particular application area of machine learning or deep learning is, for example, computer vision, which may be used in a perception system. In computer vision, deep neural networks learn to recognize object classes in images. Computer vision includes:
-simple classification, wherein the whole image is classified by the object class of its most dominant object;
-object detection, wherein the neural network predicts a bounding box around the object in the image;
semantic segmentation, in which each pixel of an image is classified (tagged) with the object class of the object to which it belongs;
-an optical flow predicting a vector field of movement of the object shown;
and acquisition methods such as instance segmentation, panoramic segmentation, video segmentation, etc. For all these methods, the neural network needs to be trained by means of a training dataset in order to learn the class of objects or features to be predicted. For supervised learning, the training dataset includes pairs of input images and corresponding labels with desired results (i.e., object classes represented in the input images). The label may be a simple class name of the whole image in case of classification; or in the case of semantic segmentation, a pixel-by-pixel manually tagged image. In this context, we will most often refer to the case of semantic segmentation. However, the claimed methods and techniques will apply to all of the various computer vision techniques and all of the various architectures of the neural networks associated with these techniques.
The invention relates in particular to semantic segmentation of images by means of a segmented neural network. An image is typically made up of pixels representing a picture taken by a camera having lenses that project the image onto an image sensor. The image sensor converts the projected picture into a matrix of pixels representing the picture. The picture or image may be, for example, a frame of a video stream.
The matrix of pixels representing the image may be fed to the input layer of the segmented neural network and thereby used as the matrix of input image pixels. Each input image pixel matrix represents a sample to be processed through the segmented neural network.
Semantic segmentation of an image represented by a matrix of pixels is used to assign regions in the image (or more precisely, individual pixels in corresponding sections in the image) to the identified objects. To identify objects in an image, a Convolutional Neural Network (CNN), for example, a full convolutional neural network (FCN), is generally used. Those convolutional neural networks are trained as multi-class classifiers that can detect the objects in the image for which the multi-class classifiers are trained. A full convolutional neural network used for semantic segmentation typically includes multiple convolutional layers for detecting the appearance of features (i.e., the objects for which CNNs are trained) and a pooling layer for downsampling the output of the convolutional layers at a particular stage. The layer is made up of nodes. The node has an input and an output. The inputs of the nodes may be connected to some or all of the outputs of the nodes of the previous layer, thereby receiving output values from the outputs of the nodes of the previous layer. The values received by the nodes via their inputs are weighted and the weighted inputs are added to form a weighted sum. The weighted sum is transformed into the output of the node by the activation function of the node.
The structure (topology or architecture thereof) defined by the layers of the neural network is defined prior to training of the neural network.
During training of the neural network, the weights in the nodes of the layers of the neural network are modified until the neural network provides a desired or expected prediction. Thus, the neural network learns to predict (e.g., identify) the class of objects or features for which the neural network is trained. Parameters created during training of the neural network, in particular the weights of the inputs of the nodes, form a model. Thus, the trained neural network implements the model.
As regards the convolution layers, it is known to the skilled person that these use an array of filter kernels having a size that is much smaller than the pixel matrix representing the image. The filter kernel array is composed of array elements with weight values. The filter kernel array is moved stepwise over the image pixel matrix and as the filter kernel matrix is "moved" over the image pixel matrix, the individual values of the pixels in the image pixel matrix are multiplied element-by-element with the weight values of the corresponding elements of the filter kernel matrix, thereby convolving the image pixel matrix. Typically, multiple arrays of filter kernels are used to extract different low-level features.
The convolved output from such convolutions in the convolution layer may be fed as input to the next convolution layer and convolved again by means of the filter kernel array. The convolved output is referred to as a feature map. For example, a feature map may be created for each color channel of a color image.
Finally, the convolved output of the convolved layer may be rectified by means of a nonlinear function (e.g., a ReLU (rectified linear unit) operator). The ReLU operator is a preferred activation function for the nodes of the convolutional layer that removes all negative values from the output of the convolutional layer. The purpose of the nonlinear function is that no network thereof can only learn a linear relationship, since the convolution operation itself is linear, for the learned weights w and inputs x, s (t) = (x w) (t); see:https:// www.deeplearningbook.org/contents/convnets.htmleq.9.2. The nonlinear function enables the network to learn the nonlinear relationship. Elimination of negative values is only a side effect of ReLU, which is unnecessary for learning, and conversely, even undesirable. For example, "leak ReLU" does not have such side effects. Most architectures choose ReLU due to its simplicity, reLU (z) =max (0, z); see:https://www.deeplearningbook.org/contents/mlp.htmleq.6.37. The ReLU operator is part of the corresponding convolution layer.
In order to reduce the size of the feature map by downsampling, a pooling layer is used. One way of downsampling is called Max-Pooling. When maximum pooling is used, for example, each 2 x 2 sub-array of the feature map is replaced with a single value corresponding to the maximum of four elements of the 2 x 2 sub-array.
The downsampled feature map may be processed again in the convolutional layer. Finally, the feature map from the final convolution layer or pooling layer is a score map (score map).
For each object class, neural network training is used for feature score graphs and segment maps (segment maps) are generated. If, for example, neural network training is used for five classes of objects (e.g., cars, traffic signs, humans, street signs, street boundaries), five feature score graphs are generated.
For an object class, the score in a feature score graph represents the likelihood that the full convolution network is trained to represent the objects of the object class, and the full convolution network is trained to represent in the input image matrix. Thus, the objects represented in the input image pixel matrix are "identified" and a feature score map is generated for each object class, wherein a score is formed for each (downsampled) pixel, which score indicates the likelihood that the pixel represents the object of that object class. The score corresponds to the level of activation of the element in the feature score map. The scores of the elements in the feature score graph may be compared to each other in order to assign each element or pixel to one of the known object classes, respectively. Thus, a segment map may be generated in which elements are labeled with labels indicating to which of all objects an individual pixel for which the segmented neural network is trained may belong.
For example, if the neural network is trained for five objects, five feature score graphs are generated. With the aid of the Softmax function, the scores in the individual feature score graphs can be normalized on a scale between 0 and 1; see:https://www.deeplearningbook.org/contents/mlp.htmleq.6.29. Normalized scores of corresponding elements of the feature score graphs may be compared and the element may be assigned to the object class corresponding to the score graph having the highest score of the corresponding element. This is done by means of a maximum likelihood estimator using the Argmax function; see:https:// www.deeplearningbook.org/contents/ml.htmleq.5.56. Thus, by comparing the scores of the corresponding elements in the feature score map, a section map can be generated from the feature score map.
The score of each element in the feature score graph for the corresponding object class represents the level of activation of that element. The higher the activation level (and thus the higher the score), the higher the chance that an element in the score map corresponds to a pixel (or pixels) belonging to the corresponding object in the input image pixel matrix, e.g., if the score map is for the object class "car", the higher the chance that the pixel is a car in the image.
The Full Convolutional Network (FCN) acts as an encoder that can detect objects in an image matrix and generate low resolution tensors that contain high-level information (i.e., complex features).
The final feature score map is upsampled to assign a score to each pixel in the input pixel matrix that indicates the likelihood that the pixel represents an object of the known object class. Such upsampling may be achieved by means of bilinear interpolation.
Alternatively, upsampling may be implemented by a decoder.
For semantic segmentation, all architectures are encoder-decoder architectures, where downsampling and upsampling refer to the process of learning simple, more abstract features from pixels in a first convolutional layer, complex features from simple features in the next layer, and so on; and conversely learn the same in the decoder. Since the size of the feature score map of the last layer is not necessarily the same as the size of the input image and at least in training the size of the output needs to match the size of the label with the same size as the input, upsampling is done by means of bilinear interpolation.
For semantic segmentation, a segment map is generated from the feature score map. Each element in the segment map is assigned a label indicating for which of the known object classes the highest level of activation was found in the feature score map.
If the convolutional neural network is not used for semantic segmentation of the image, but is used only to identify the presence of objects in the image (i.e., simple classification), the output of the final convolutional layer, the ReLU layer, or the pooling layer may be fed into a fully connected layer that generates an output vector, where the values of the vector elements represent scores indicating the likelihood that the corresponding object is present in the analyzed image. The output vector of such a classification CNN is therefore a feature vector. Neural networks for semantic segmentation (further referred to as segmented neural networks) do not have such fully connected layers, as this would destroy information about the position of the object in the input image matrix.
In any case, it is necessary to train the neural network by means of a training data set comprising image data and a label (ground score) indicating the content represented by the image data. The image data represents an input image matrix and the labels represent the desired outputs. In the back propagation process, the weight and filter kernel array are iteratively adapted until the difference between the actual output of the CNN and the desired output is minimized. The difference between the actual output of the CNN and the desired output is calculated by means of a loss function.
The neural network calculates class predictions for individual pixels in the output image based on a training dataset comprising pairs of input images and truth images consisting of correct class labels. In training, the loss function compares the input class labels to predictions made by the neural network, and then pushes parameters of the neural network (i.e., weights in nodes of the layer) in a direction that will result in better predictions. When such training is repeated with a large number of pairs of input images and truth images, the neural network will learn the abstract concepts of a given class.
As indicated above, semantic segmentation results in assigning pixels of an image to a known object, i.e., an object for which a neural network is trained. Simple semantic segmentation does not distinguish between several instances of the same object in an image. For example, if multiple cars are visible in an image, then all pixels belonging to a car are assigned to the object "car". No individual car (i.e., an instance of the object "car") is identified. Discriminating object instances of an object class requires instance segmentation.
Nevertheless, pixels that do not belong to the object for which the neural network is trained will still be assigned to one of the object classes for which the neural network is trained, and the score may even be high. Even with a low score, the object class with the highest relative score will "win out", i.e., the pixel will be assigned to the object class with the highest score.
When using computer vision and semantic segmentation to facilitate autonomous driving in a real-world environment (e.g., in a vehicle), the uncertainty involved in computer vision needs to be considered.
Alex kendall et al, "Bayesian SegNet: model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding"10.October 2016;https://arxiv.org/pdf/1511.02680v2.pdf"a deep learning framework for probabilistic pixel-by-pixel semantic segmentation is presented, as well as a system that can utilize model uncertainty metrics to predict pixel-by-pixel class labels.
It is an object of the present invention to provide a system that can assess uncertainty in a perception system.
According to the present invention, a sensing system is provided, comprising a segmented neural network and an uncertainty detector.
The perception system may be part of a vehicle (e.g., an automobile) to enable autonomous travel. In various embodiments, the perception system may also be part of various autonomous machines (e.g., robots, cargo systems, or other machines designed to operate at least partially autonomously).
The segmented neural network is configured and trained for segmentation of the input image pixel matrix, thereby generating segments composed of elements corresponding to pixels in the input image pixel matrix. By class prediction, each element in the segment map is assigned to one of a plurality of object classes for which the segmented neural network is trained by class prediction. Elements assigned to the same object class form segments in the segment map.
The uncertainty detector is configured to generate an uncertainty score (also referred to as an "uncertainty map") that is made up of elements corresponding to pixels in the input image pixel matrix. Each element in the uncertainty map has an uncertainty score that is determined by an uncertainty detector and reflects the amount of uncertainty involved in the class prediction for the corresponding element in the segment map.
Preferably, the uncertainty detector is configured to access the feature score map before generating the segment map from the feature score map generated by the segmented neural network. The uncertainty detector is configured to determine the amount of uncertainty for each element in the uncertainty score by determining the variance of the activation level of the element of the feature score and thereby determining the uncertainty score.
Preferably, the uncertainty detector is configured to determine the amount of uncertainty of each element in the uncertainty score map and thereby the uncertainty score by determining an inter-sample variance (inter-sample variance) of the activation level of the element in the feature score map in different samples in the feature score map.
Preferably, the uncertainty detector is configured to generate the inter-sample variance of the activation level of the elements in the feature score map in different samples in the feature score map by processing the input image pixel matrix with the split neural network in multiple passes while randomly modifying the split neural network for each pass.
Preferably, the random modification of the segmented neural network comprises at least one of: the method may include the steps of deactivating (dropping) nodes from convolutional layers of the split neural network, altering an activation function in at least some nodes of at least some layers of the split neural network, and/or introducing noise in at least some nodes of at least some layers of the split neural network.
Preferably, the uncertainty detector is configured to determine the sample-to-sample variance of the activation level of the corresponding element in the feature score map generated from the matrix of consecutive image pixels corresponding to the frame of the video sequence. Since in this case the samples are frames of the video sequence, the inter-sample variance corresponds to the variance between frames and is thus also referred to as "inter-frame variance" in the following. In a potential implementation of this embodiment, one pass to determine the inter-sample variance from n consecutive images would process image a..a+n, then restart with frame a+n+1..a+2n. To increase the likelihood that an unknown object o that is present somewhere in the sequence aa. a + i a + n is not left, it is preferred to run a number of such passes in parallel, each pass starts with an offset b, for example, a second parallel pass starts with a+b..a+b+n, and then restarts with a+b+n+1..a+b+2n. In this second parallel pass, object o will be closer to the beginning of the second parallel pass than in the first parallel pass, and the process will establish a better variance with respect to object o.
To improve the efficiency of the determination of the inter-frame variance, only every second, every third, or every nth frame in the corresponding sequence may be used, depending on the frame rate. Typical frame rates will be between 25 frames per second (fps) and 60 fps.
Preferably, the uncertainty detector is configured to determine the amount of uncertainty of each element in the deterministic score map and thereby the uncertainty score by determining an inter-class variance (inter-class variance) between the activation levels of the corresponding elements in the feature score map for different classes of objects as provided by the segmented neural network.
Preferably, the uncertainty detector comprises a generating neural network for which the segmented neural network is trained, in particular a variational automatic encoder trained for the same class or object.
Preferably, the uncertainty detector is configured to:
-detecting a tagged region in the uncertainty score map consisting of elements with high uncertainty scores; and
-tagging a region consisting of elements with high uncertainty scores as candidates for objects representing object classes that are not yet known.
The high uncertainty score may be an uncertainty score that is higher than the average uncertainty score of all elements of the uncertainty score map. The high uncertainty score may be an uncertainty score that is higher than the median of the uncertainty scores of all elements in the uncertainty score map. The high uncertainty score may be an average uncertainty score or an uncertainty score that is median by a predetermined amount over the uncertainty scores of all elements in the uncertainty score map.
It is further preferred that the uncertainty detector is configured for:
-verifying that the region tagged as a candidate for an object representing an object class that is not yet known does represent an object of an object class that is not yet known by determining the plausibility (plausibility) that the segment in the input image pixel matrix corresponding to the tagged region represents an unknown object.
The potential means for determining such rationality as disclosed in the section "preferred way of checking that a section represents rationality of an unknown object" is:
preferably, the perception system is configured to reconfigure the segmented neural network by adding a model for a new object class (e.g., by introducing or configuring other layers having weights representing the new object class, wherein the new object class includes hitherto unknown objects represented in the segments found to represent the hitherto unknown objects).
Creating a new segmented neural network configured to predict new object classes (i.e., it incorporates a model for the new object classes, the model being created by training the neural network) or reconfiguring an existing segmented neural network so that the existing segmented neural network can predict objects of additional new object classes as part of domain adaptation of the operating domain of the extended perception system.
Reconfiguring an existing segmented neural network preferably comprises: the split neural network is trained using an input image pixel matrix representing the hitherto unknown object in combination with labels for the then new object class (the hitherto unknown object class and the true value of the then new object class). The tags may be generated automatically or for segments representing unknown objects.
Preferably, the existing split neural network or preferably the second similar partner neural network is trained with the most recently determined unknown object class, or by uploading the most recently detected object class and downloading the trained neural network or the trained similar partner neural network after training (e.g., to the right of the perception system on the autonomous vehicle) in the cloud.
For example, recently detected object classes may be shared directly with other autonomous vehicles, so that each vehicle may train their segmented neural network or similar partner neural network over a large number of unknown object classes without having to rely on updates from the cloud.
Such training on multiple autonomous vehicles may be parallelized by implementing distributed data parallel training (as explained below: pyTorch Distributed: experiences on Accelerating Data Parallel Training, https://arxiv.org/pdf/2006.15704.pdf). Any such training will preferably be performed by means of small sample (few-shot) learning (as explained below: an Overview of Deep Learning Architectures in Few-Shot Learning Domain)https://arxiv.org/pdf/ 2008.06365.pdf)。
The assignment of individual, as yet unknown, objects to new object classes can be automated by their similarity, preferably determined by means of unsupervised segmentation (as explained in the following: unsupervised Learning of Image Segmentation Based on Differentiable Feature Clustering),https:// arxiv.org/pdf/2007.09990.pdf)。
The newly created model for the new object class may be automatically associated with the already existing model for the known object class by means of similarity, which is determined by means of one-shot learning (as explained below: fully Convolutional One-Shot Object Segmentation for Industrial Robotics,https://arxiv.org/pdf/1903.00683.pdf) In single sample learning, elements of an abstract feature vector of a neural network that performs single sample learning have a semantic structure, e.g., the feature vector may be composed of a textual description of an object, where, for example, the description of a human vehicle (ricksshaw) will be similar to that of an automobile, and thus, a recently determined class of objects for a human vehicle may be considered to be similar to an automobile by an autonomous vehicle without requiring manual association).
According to the present invention, there is provided a method for semantic segmentation of an input image pixel matrix by class prediction performed by a segmented neural network and for determining the amount of uncertainty involved in the class prediction for each pixel in the input image pixel matrix. The method comprises the following steps:
-segmenting the input image pixel matrix by means of a segmented neural network trained for a plurality of object classes and generating a feature score graph for each object class, and generating a segment graph by assigning elements in the segment graph of the input image pixel matrix to one of the plurality of object classes according to the feature score graph, the segmented neural network being trained for the plurality of object classes by class prediction, the elements assigned to the same object class forming segments in the segment graph; and
-generating an uncertainty score map (abbreviated as uncertainty map) consisting of elements corresponding to pixels in the input image pixel matrix, each element in the uncertainty map having an uncertainty score determined by an uncertainty detector and reflecting the amount of uncertainty involved in class prediction for the corresponding element in the segment map.
Preferably, the method further comprises the steps of:
-detecting a region in the uncertainty score map consisting of elements with high uncertainty scores; and
-tagging a region consisting of elements with a high uncertainty score, which is a higher uncertainty score than the average uncertainty score of all elements in the uncertainty score map, as a candidate for representing an object of the object class that is not yet known.
Preferably, the method further comprises the steps of:
-if a region consisting of elements with high uncertainty score is detected, creating a new object class representing the object as shown in the region in the input image pixel matrix corresponding to the region in the uncertainty score map consisting of elements with high uncertainty score.
The method for generating a new object class may thus comprise the steps of:
-detecting a region in the uncertainty score map consisting of elements with high uncertainty scores;
-generating tentative new object classes and automatically generating tags; and
-identifying an existing object class or another new object class similar to the tentative new object class, for example by means of small sample learning.
The method may comprise the steps of: a tentative new object class is generated in case an unknown object is detected. If another unknown object is detected, the method comprises the steps of: the other unknown object is assigned to the new object class of existing trial or to the new object class of the other unknown object based on its similarity. In the case where another unknown object may be assigned to a previously unknown object (based on similarity of objects), a new object class may be generated (i.e., no longer tentative) by single sample learning or small sample learning using samples representing new objects that are similar to each other.
Alternatively, the method for generating a new object class may comprise the steps of:
-generating a feature score map from a matrix of input image pixels (samples) captured at different time instances;
-detecting a region in the uncertainty score map consisting of elements with high uncertainty scores, the region representing a first unknown object;
-generating a tentative new object class of said unknown object and automatically generating a label; and
-detecting a further region in a further uncertainty score map consisting of elements with high uncertainty scores, the region representing a further unknown object;
-determining a similarity between the first and the other; and
-generating a new object class that is non-tentative if the similarity exceeds a predetermined threshold.
Preferably, the step of generating a new object class that is non-tentative comprises: single sample learning or small sample learning is performed using samples (i.e., an input image pixel matrix) representing a first unknown object and another unknown object.
Explicit tags are not required, but only automatically generated tags for generating new object classes (e.g. by means of a new object detector) and for transfer learning (transfer learning) of new object classes (e.g. generated by a new object detector).
The relevance of a new object class may be determined based on the user's reaction in the context of encountering an unknown potential new object. This is explained in further detail below. The user's reaction in the context of encountering an unknown potential new object may be determined based on optical flow, inertial sensors (gyroscopes), or from signals on the CAN bus of the vehicle.
The new object class thus fully automatically generated may be used for training and thus updating the semantic segmentation model implemented by the segmentation neural network of the perception system.
A region composed of elements with high uncertainty scores may correspond to a portion of a section in a section graph. For example, if a portion of the input image pixel matrix represents an object that is not already known, then the pixels of that portion will be assigned to a known object class by the segmented neural network. However, pixels representing unknown objects (which do not correspond to any of the object classes for which the segmented neural network is trained) will typically exhibit a higher uncertainty score, so objects that are not yet known can be found by means of the uncertainty score map.
Preferably, the uncertainty score is determined for different classes of objects by determining an inter-class variance of the activation level of the elements in the feature score map.
Preferably, the uncertainty score is determined for one object class by determining the inter-sample variance of the activation level of pixels of different samples in the feature score map.
Another aspect of the invention is the use of the method and its embodiments as exemplified herein in an autonomous machine, in particular in an autonomous vehicle.
Accordingly, a system for detecting and managing uncertainty in a perception system is provided, the system comprising a data processing unit implementing a segmenter comprising a trained segmented neural network for semantic segmentation of an input image matrix. The neural network includes a convolutional layer and is configured to perform semantic segmentation of the input image pixel matrix. Semantic segmentation is based on a plurality of object classes for which the neural network is trained.
The system also includes an uncertainty detector configured to determine an uncertainty score for each pixel, group of pixels, or segment. The uncertainty score reflects how well (uncertainty) a pixel of a segment represents an object assigned to by the segmented neural network. A low level of certainty (i.e., a high level of uncertainty) may mean that a pixel of a segment (and thus the segment) is erroneously assigned to an object. The reason may be that the segment represents an object for which the neural network of the segmenter is not trained accordingly (i.e., an object of an object class that is not yet known), i.e., an unknown object or object class.
The uncertainty detector may be configured to generate an uncertainty score map in which each pixel in the segment map (and thus, typically, each pixel in the input image pixel matrix) is assigned an uncertainty score. Preferably, the size of the uncertainty score map matches the size of the segment map generated by the segmented neural network. Each element of the uncertainty score map then corresponds directly to an element in the segment map and preferably also to one pixel in the input image pixel matrix.
The uncertainty detector is connected to a segmented neural network for semantic segmentation of the input image matrix.
Thus, the system comprises:
-a segmented neural network generating a segment map of the respective input image pixel matrix; and
-an uncertainty detector that generates an uncertainty score for each pixel in the input image pixel matrix and/or each element in the section map.
In an alternative embodiment, the uncertainty detector
It may be configured to determine the uncertainty from a graph of feature scores for different object classes provided by the segmented neural network by analyzing the inter-class variance of the activation levels (feature scores) of the pixels in the segment. In particular, the uncertainty detector may be configured to determine a softmax confidence.
-may be configured to cause a change in a change node of a layer of the segmented neural network. These changes may be achieved by randomly inactivating the nodes (monte carlo random inactivation (Monte Carlo drop out)) or by randomly modifying the activation function of some of the nodes. The input image pixel matrix is processed multiple times by the segmented neural network and the segmented neural network is randomly modified each time, providing a modified feature score map. Thus, a plurality of samples of the variation of the feature score map are generated. The inter-sample variation of the feature score map samples depends on the effect of the variation of the nodes on the scores in the score map. Processing the input image pixel matrix multiple times with the split neural network while randomly modifying the split neural network causes the split neural network to become a Bayesian (Bayesian) neural network that implements variational reasoning. Preferably, instead of processing the same input image pixel matrix multiple times, multiple frames of the video sequence may be used as the input image pixel matrix. Each frame of the video sequence is a matrix of input image pixels. If the frame rate of the video sequence is high enough, the inter-frame variation (i.e., the variation from frame to frame) is small enough because the recorded images do not change much from frame to frame. The segmented neural network is modified randomly for the processing of individual frames (i.e., individual image pixel matrices of a video sequence).
A generating neural network for which the segmented neural network is trained, in particular a variational automatic encoder trained for the same class or object, may be included.
The system is further configured to:
-generating semantically segmented image pixel matrix samples using a segmented neural network;
generating inter-frame variances from a sensor data stream (i.e. a video stream or video sequence) consisting of frames-i.e. mapping corresponding pixels of consecutive frames and taking each consecutive frame or every nth consecutive frame as a sample instead of sampling each frame a plurality of times;
-analyzing inter-class variances and/or inter-frame variances in per-pixel activation levels of individual object classes for which the neural segmentation network is trained, i.e. analyzing per-pixel scores of individual object classes;
-determining an uncertainty score by assessing an uncertainty level or an amount of uncertainty based on an analysis of inter-class variances and/or inter-frame variances in per-pixel activation levels in the semantically segmented image pixel matrix; and
-identifying a section showing the unknown object according to the determined uncertainty.
The mapping of corresponding pixels of successive frames of the video sequence may comprise determining the displacement of the corresponding pixels between two frames, e.g. based on position sensor data, which may be collected, e.g. by means of inertial sensors. Thus, the movement of the video camera providing the video sequence may be determined such that displacement vectors of pixels between frames can be determined, which displacement vectors are to be used for determining the inter-frame variance.
Preferably, the system is configured to use more than one source of uncertainty, e.g. inter-class uncertainty and inter-frame uncertainty, simultaneously and to combine these uncertainties, e.g. by means of a weighted sum, wherein the weight depends on the number of samples that have been taken. For example, using the first 2 or 3 frames to determine the inter-frame variance does not yet yield truly good results, so for step 1 … m in the sequence, only the inter-class variance can be used, while for frame m+ … n, the weight on the inter-frame variance can be increased and the weight in the inter-class variance for the weighted sum of the two can be reduced. Instead of a weighted sum, a bayesian filter (e.g., a particle filter) may be used to combine these uncertainties, and instead of weights, confidence scores in the uncertainty scores for each of the two types of uncertainties (inter-class uncertainty and inter-frame uncertainty) may be provided, and the bayesian filter will generate confidence scores in the resulting uncertainties.
Effectively, the system is further configured to:
-discriminating between different unknown objects by analyzing the uncertainty.
In a preferred embodiment, the systemizer is configured to
-creating a new object class based on an analysis of the uncertainties of pixels in the section determined to display the unknown object.
The split neural network is preferably a Full Convolutional Network (FCN).
The uncertainty may be either not yet positive (aletoric) or cognitive (epistemic). Uncertainty that is not yet positive is caused by statistical variations in the environment. Uncertainty that has not yet been confirmed often occurs at the boundaries of segments in the segment map. Cognitive uncertainties (also referred to as system uncertainties) result from mismatches in models (e.g., models implemented by neural networks), for example, because the neural networks are not trained for particular classes of objects that occur in the environment; see Kendall and Galhttps://arxiv.org/abs/1703.04977
The uncertainty may be determined and quantified by analyzing the amount of variance of the activation level (i.e., the score in the feature score graph) over a plurality of samples in the feature score graph generated by the segmented neural network, the plurality of samples of the feature score graph forming an input image pixel matrix or a plurality of similar input image pixel matrices. For each partition (i.e., each pass), the partition neural network is modified to create a variance in the partition neural network, and thus the partition neural network becomes a bayesian neural network that implements variational reasoning. In bayesian neural networks, the input image matrix is processed in multiple passes, with some nodes of the CNN being randomly deactivated for each pass. This technique is known as variational reasoning. Because of the deactivation of the nodes in some passes, the level of activation varies from pass to pass. The higher this variance of the score (i.e., activation level) in the feature score graph, the higher the uncertainty. Alternatively, gaussian noise may be applied to the signal, weights, or activation functions at the input node for variational reasoning, see Yarin Gal's paper http://mlg.eng.cam.ac.uk/ yarin/thesis/thesis.pdf
To increase the efficiency of the variational reasoning, random inactivation may be applied to only some layers or even to only one layer.
The variance of the activation level may also be caused by the use of a sequence of pictures of the same scene taken from different locations or at different points in time. Such a sequence may be a frame of a video sequence recorded by a video camera of a moving vehicle. This aspect is referred to as the "inter variance" as mentioned earlier herein.
If the neural network is not a Bayesian neural network, the uncertainty can be determined by determining a variance amount or variance pattern of the activation level. Uncertainty can also be determined via generating reconstruction errors for the neural network.
Unknown objects (i.e., objects for which the neural network is not trained or objects for which the neural network is not identified) result in a high level of uncertainty.
Thus, if the unknown object is represented by pixels in the input image pixel matrix, these pixels, and thus the unknown object represented by these pixels, can be "found" by determining the uncertainty. This may be used to assign segments in the input image pixel matrix to unknown objects and even further discern different unknown objects in such segments in the input image pixel matrix.
Hereinafter, with respect to the level or "amount of uncertainty of a pixel," the degree of uncertainty refers to the degree of uncertainty determined by analyzing the variance of the activation level (i.e., score) of an element in a score graph of an object class. The elements (i.e., individual elements) in the feature score map (array) of the object class each relate to one or more pixels in the input image matrix. Thus, the amount of uncertainty (and thus the uncertainty score) that can be determined from the activation level (i.e., score) of an element in the score map is also the amount of uncertainty of the image pixel corresponding to the corresponding element in the score map.
The preferred way to determine the uncertainty is:
the uncertainty score may be determined based on the amount of variance or the pattern of variances, or the type of variances (i.e., cognitive or yet to be affirmative), or based on the temporal variation of variances of the activation level (i.e., the score of a feature score map), or a combination of these parameters.
The uncertainty determined from the activation level may be mapped onto the input image pixel matrix according to the size of the input or the size of the segment map. This means that the size of the uncertainty score map, which keeps the uncertainty score, is the same as the size of the feature score map/section map, and has to be scaled up to the size of the section map by means of bilinear interpolation, e.g. the feature score map has to be scaled up to the size of the input image pixel matrix.
The presence of a segment representing an unknown object may be determined based on a distribution of uncertainty amounts assigned to pixels in the input image matrix (i.e., a distribution of uncertainty scores in an uncertainty score graph), wherein the uncertainty amounts (i.e., the uncertainty scores of individual elements/pixels) are determined based on a variance of activation levels of elements in a feature score graph that are related to pixels in the input image matrix.
Image segments representing unknown objects may be determined by determining clusters of pixels with a high uncertainty score (i.e., for which a high amount of uncertainty is found) or by determining contiguous pixels with a similar amount of uncertainty (and thus a similar uncertainty score). The amount of uncertainty, and thus the uncertainty score, for a pixel is determined by determining the variance of the activation levels of those elements in the feature score map that correspond to the pixel in the input image pixel matrix.
The size of the section representing the unknown object may be determined from the size of the cluster of pixels with high uncertainty scores or by the length of the contour of the field of contiguous pixels with similar uncertainty scores.
The location of the segment representing the unknown object may be determined from the location of the cluster of pixels with high uncertainty scores or by determining the location of the contour of the neighboring pixels exhibiting similar uncertainty scores.
The relative movement of the segments representing the unknown object may be determined from the temporal change in the position of the cluster of pixels with high uncertainty scores or the position of the contour of the neighboring pixels exhibiting similar uncertainty scores.
The relative distance between the camera generating the input pixel matrix and the section of the unknown object representation may be determined from the time variation of the variance of the activation level. Thus, the 3D position of an unknown object from the 2D input image pixel matrix sequence can be calculated.
Moreover, the relative distance between the camera generating the input pixel matrix and the unknown object represented in the segment may be determined by an additional depth estimation model.
Preferred ways of checking that a segment represents the rationality of an unknown object
Since the segmented neural network assigns each pixel to a class of known objects, the segmented neural network does not identify an unknown object. However, pixels representing objects that are not yet known may be determined by means of analysis uncertainty.
The way to make a segment in the input image matrix appear plausible to represent an unknown object-and thus find a plausible confirmation that a pixel segment with a high amount of uncertainty represents an unknown object is:
the rationality of the segment representing the unknown object may be obtained from the determined contour of the segment and its correlation with the segment predictions generated by the neural network (i.e. semantic segmentation). If the entire segment is composed of pixels with high uncertainty scores (i.e., most of the pixels within the contour of the segment), then the segment is likely to represent an unknown object.
The rationality of the segments representing the unknown object can be determined from the detected contours of the segments and their correlation with the alternative segmentation of the input image matrix (e.g. by means of an automatic encoder).
The rationality of the section representing the unknown object can be determined from the time variation of the variance of the activation level.
The rationality of a segment representing an unknown object may be determined by comparing the segment with another segment representing an unknown object found by the coexistence redundancy subsystem for semantic segmentation.
The rationality of a segment representing an unknown object may be determined based on a comparison with another segment representing an unknown object, which is found by a different system for semantic segmentation using input image matrices representing the same scene taken from different perspectives. Such comparison includes a transformation that represents two segments of the unknown object in a common coordinate system (e.g., a global coordinate system).
According to another aspect, a system is provided that includes a main segmentation system and a separate autonomous device.
The primary segmentation system includes a primary perception system including a segmented neural network configured and trained to segment an input image pixel matrix to generate a segment map. Each element in the segment map is assigned to one of a plurality of object classes for which the segmented neural network is trained by class prediction. Elements assigned to the same object class form segments in the segment map. The primary segmentation system may be part of an Autonomous Driving System (ADS) and typically does not include an uncertainty detector.
The individual autonomous device includes: a sensor for generating a matrix of input image pixels made up of matrix elements, a segmented neural network, and an uncertainty detector. The uncertainty detector is configured to generate an uncertainty score consisting of matrix elements corresponding to pixels in the matrix of input image pixels. Each matrix element in the uncertainty score graph has an uncertainty score that is determined by the uncertainty detector and reflects the amount of uncertainty involved in the class prediction for the corresponding element in the segment graph. The segmented neural network of the autonomous device is preferably trained for the same object class as the segmented neural network of the host system.
Preferably, the autonomous device further comprises signaling means configured to generate a user perceivable signal in case the uncertainty score map generated by the uncertainty detector of the autonomous device comprises an area consisting of matrix elements exhibiting an uncertainty score above a threshold value and thereby representing edge conditions of the object classification and the image segmentation. The edge condition may represent an object that is not yet known and is thus a candidate for a new object class.
The autonomous device may determine whether the segmentation implemented by the primary perception system is reliable. In particular, the autonomous device may determine whether a segmentation score map generated by a segmentation neural network of the primary segmentation system contains regions (e.g., objects) representing edge conditions where the primary segmentation system is untrained.
The autonomous device preferably comprises a new object detector operatively connected to the uncertainty detector of the autonomous device. The new object detector is configured to find a region consisting of elements with high uncertainty scores in the uncertainty score map and to generate a new object class for such found region.
The autonomous device may be configured to exchange tagged data sets including segments assigned to recently generated object classes with other autonomous devices, thereby increasing the number of known object classes available for semantic segmentation by the autonomous device. Labels for recently generated object classes may be automatically generated.
The autonomous device may be configured to record a response in response to the alert signal transmitted by the autonomous device. The reaction type may be used as input data for distinguishing between related unknown objects and less related unknown objects. The salient response indicates a high correlation of the unknown object. No reaction indicates that the correlation of the unknown object is low.
The autonomous device may be configured to learn to discern unknown objects by automatically generating new object classes. Moreover, the autonomous device may be configured to learn the correlation level of the respective (new) object class and thereby may employ a warning signal level for that object class. If the autonomous device determines that an object is present in the region of interest of the input image pixel matrix (by image segmentation), a warning signal may be generated from the identified correlation of the object. The new object class may be automatically tagged such that the tag represents a level of relevance. For example, the level of relevance may be used as a label for each new object class.
The autonomous device may be configured to exchange data sets comprising data representing a level of correlation of known object classes or observed reactions (observed behaviors) of the user with other autonomous devices. The observed user's reactions can be used to automatically generate labels for new object classes. For exchanging data sets with other autonomous devices, the autonomous device may comprise a data interface, in particular a wireless data interface allowing data exchange via, for example, the internet.
The autonomous device is preferably a pocket-type mobile device that can be easily carried by a user and can be easily mounted to, for example, a windshield of a vehicle. In use, the autonomous device is preferably mounted in a position wherein the perspective of the autonomous device corresponds at least in part to the perspective of the sensor 12 or sensors of the autonomous driving system.
The autonomous device may comprise a segmented neural network implementing a semantic segmentation model, the segmented neural network being continuously trained, preferably with the output of the new object detector as input, enabling the neural network to predict new previously encountered objects.
The invention will now be further illustrated by way of example with reference to the accompanying drawings. In the figure:
FIG. 1 is a schematic overview of a sensing system that provides output for other systems (e.g., a sensor fusion system);
FIG. 2 is a schematic representation of a neural network suitable for semantic segmentation;
FIG. 3 is a schematic illustration of how an object class is assigned to a pixel in an input image pixel matrix according to a score in a score graph of the respective object class;
FIG. 4 is a schematic overview of a data processing unit for use in the perception system according to FIG. 1 comprising an uncertainty detector according to the present invention;
FIG. 5 illustrates training of a neural network based on an input dataset (i.e., desired output, true values) including images and labels;
FIG. 6 illustrates predictions of a neural network based on an input dataset;
FIG. 7 illustrates an image represented by an input dataset of a semantically segmented neural network;
FIG. 8 illustrates predictions (i.e., semantic segmentation) provided by a neural network for the image of FIG. 5;
FIG. 9 shows regions and sections for the image of FIG. 5 where the pixels have a high amount of uncertainty;
FIG. 10 is a schematic illustration of a system including a perception subsystem, a planning subsystem, and a control subsystem;
FIG. 11 illustrates unknown object rationality checking based on two independent perception subsystems of a vehicle;
FIG. 12 illustrates unknown object rationality checking based on two independent perception subsystems of two vehicles;
FIG. 13 shows the correlation of uncertainty shapes to unknown objects;
FIG. 14 illustrates determination of a section with an unknown object;
FIG. 15 illustrates the source of a domain adaptive dataset;
FIG. 16 illustrates the uploading and integration (integration) of domain adaptive data sets;
FIG. 17 illustrates the determination of unknown object classes;
FIG. 18 illustrates training a neural network with a domain adaptive dataset to generate a new model that can handle new object classes;
FIG. 19 illustrates training and downloading of models for neural networks;
FIG. 20 illustrates an autonomous device acting as a security companion;
FIG. 21 illustrates the generation of a training data set for a new object;
FIG. 22 illustrates a training dataset of a new object;
FIG. 23 illustrates sharing of training data sets for new objects;
FIG. 24 illustrates training a segmented neural network with a training dataset of new objects;
FIG. 25 illustrates identification of false positive (false positive) object classes;
FIG. 26 illustrates a sensing system with secondary and primary systems for object detection;
FIG. 27 illustrates a transfer learning from a segmented neural network with an unknown structure;
FIG. 28 illustrates training of a variation automatic encoder;
FIG. 29 illustrates uncertainty detection by means of a variational automatic encoder;
FIG. 30 illustrates single sample learning of an uncertainty detector;
FIG. 31 illustrates the use of multiple parallel redundant uncertainty detectors;
FIG. 32 illustrates an alternative apparatus for object recognition;
FIG. 33 illustrates a segmentation model that may be part of an object recognition system;
FIG. 34 illustrates use cases of action recognition;
FIG. 35 illustrates a video action recognition model;
FIG. 36 illustrates a time shift (temporal shift) module;
FIG. 37 illustrates a depth estimation model;
FIG. 38 illustrates a use case of risk estimation;
FIG. 39 illustrates use cases of action expectations;
FIG. 40 illustrates a motion forecast (forecast) model;
FIG. 41 illustrates use cases for edge condition identification;
FIG. 42 illustrates a cascade monitoring concept;
FIG. 43 illustrates a situation monitor;
FIG. 44 illustrates an automatic encoder model;
FIG. 45 illustrates a convolution sub-module;
FIG. 46 illustrates a convolution sub-module with random deactivation;
FIG. 47 illustrates a Bayesian sampling module;
FIG. 48 illustrates a Li Puxi Roots (Lipschitz) sub-module;
FIG. 49 illustrates the integration of a situation monitor with a Kalman filter;
FIG. 50 illustrates use cases of similarity prediction;
FIG. 51 is an overview of a multi-functional model;
FIG. 52 illustrates an encoder of a multi-function model;
FIG. 53 illustrates segmentation by means of a multi-functional model;
FIG. 54 illustrates video action recognition by means of a multi-functional model;
FIG. 55 illustrates depth estimation by means of a multi-functional model;
FIG. 56 illustrates an automatic encoder implemented using a multi-function model;
FIG. 57 illustrates motion forecast by means of a multi-functional model;
FIG. 58 illustrates how the system may be implemented following the perception-planning-action concept; and
Fig. 59 illustrates a preferred sensor configuration.
The perception system 10 as shown in fig. 1 comprises a camera 12 which records images by means of an image sensor (not shown). The recorded images may be still images or a sequence of images forming frames of a video stream. The input data stream may also be from a LiDAR (3D point cloud stream) or from radar.
The camera 12 generates a matrix of image pixels which are fed to a data processing unit 14. The data processing unit 14 implements a segmentation neural network 40 (see fig. 2 and 3) for image semantic segmentation.
The perception system may be integrated in various devices or machines, in particular in a vehicle, for example in an autonomous vehicle (e.g. an automobile).
One aspect is the integration of the perception system with the autonomous vehicle. In such autonomous vehicles, the system implementing the split neural network is the perception system 10. The output of the sensing system 10 is provided to a sensor fusion system 24.
The neural network implemented by the data processing unit 14 is defined by a layer structure comprising nodes and connections between the nodes. In particular, the neural network includes an encoder portion formed of a convolutional layer and a pooling layer. The convolution layer generates an output array called a feature map. The elements of these feature maps (i.e., arrays) represent the activation levels corresponding to certain features in the input image pixel matrix. Features generated by one layer are fed to the next convolutional layer, which generates another feature map corresponding to the more complex features. Finally, the activation level of the feature map corresponds to an object belonging to the class of objects for which the neural network is trained.
The effect of convolution in the convolution layer is achieved by convolving the input array with an array of filter kernels having elements representing weights applied in the convolution process. These weights are generated during neural network training of one or more specific object classes.
Training of the neural network is performed by means of a training data set comprising image data (input image pixel matrix 66, see fig. 5) and labels 68 indicating the content (true values) represented by the image data. The image data represents the input image matrix 66 and the label 68 represents the desired output. In the back propagation process, the weight and filter kernel array are iteratively adapted until the difference between the actual output of the CNN and the desired output is minimized. The difference between the actual output of the CNN (section map 72) and the desired output (tagged input image matrix 66) is calculated by means of a loss function 70.
From the training dataset containing pairs of input images and truth images composed of correct class labels, neural network 40 calculates class predictions for individual pixels in the output image (i.e., segment map 72). In training, the loss function compares the input class labels 68 to predictions made by the neural network 40 (i.e., labels 68' in the segment map 72) and then pushes parameters of the neural network (i.e., weights in nodes of the layer) in a direction that will result in better predictions. When such training is repeated with a large number of pairs of input images and truth images, the neural network will learn the abstract concepts of a given class; see fig. 5.
Thus, a trained neural network is defined by the topology of its layers (i.e., the structure of the neural network), the activation function of the nodes through the neural network, and the weights through the filter kernel array and the potential weights in summing the nodes of layers such as fully connected layers (using fully connected layers in a classifier).
The topology and activation functions of the neural network (and thus the structure of the neural network) are defined by a structure dataset.
Weights representing the particular model for which the neural network is trained are stored in the model dataset. Of course, the model dataset must fit the structure of the neural network defined by the structure dataset. At least the model data (in particular the weights determined during training) is stored in a file called a "checkpoint".
The model dataset and the structure dataset are stored in a memory 16 that is part of the data processing unit 14 or accessible by the data processing unit 14.
The data processing unit 14 is further connected to a data communication interface 18 for exchanging data controlling the behaviour of the neural network, e.g. model data stored in a model dataset or training data stored in a training dataset for training the neural network.
To visualize the segmentation provided by the neural network, a display unit 20 may be connected to the data processing unit 14. However, in an autonomous vehicle, the segmentation will not be displayed on the display unit, but will be post-processed into a list of objects, which is then provided as input to a sensor fusion system, e.g. a kalman filter.
The location of the non-existing or existing information (if any) image representation of the object in the input image pixel matrix may be obtained from the segmented input image pixel matrix. This information may be encoded in data and may be used for control or planning purposes of other system components. Thus, an output data interface 22 is provided that is configured to communicate data indicative of the absence or presence and location of an object identified in the input image pixel matrix. The sensor fusion system 24 may be connected to the output data interface 22. The sensor fusion system 24 is typically connected to other sensing systems not shown in the figures.
The sensor fusion system 24 receives input from various sensing systems 10, each within or associated with a particular sensor (e.g., camera, radar or lidar). The sensor fusion system 24 is implemented by a bayesian filter (e.g., an extended kalman filter) that processes the input of the perception system 10 as a measurement. For an extended kalman filter, these inputs are not section views, but a list of post-processed objects obtained from the section views. The extended kalman filter correlates individual measurement results with measurement uncertainty values (e.g., sensor uncertainty values) configured according to a model of a particular sensor, which are typically statically configured. In an extended kalman filter, multiple sources of such uncertainty of the measurement are added and normalized. Here, by adding model uncertainty to the measurement, the uncertainty score generated by the uncertainty detector 60 (see fig. 4) can be easily integrated. That is how uncertainty detectors are integrated with autonomous vehicles.
Image segmentation
The trained segmented neural network 40 has an encoder-decoder structure as schematically shown in fig. 2. The encoder section 42 is a Full Convolutional Network (FCN), such as res net 101. Alternative implementations may be VGG-16 or VGG-19.
Decoder portion 44 may be implemented as deep lab aspp (cavity) spatial pyramid pooling.
The split neural network 40 is a complete convolutional network consisting of a convolutional layer 46, a pooling layer 48, a feature map 50, a score map 52 (i.e., an upsampled feature map), and a segment map 54.
In the example of fig. 2, the segmented neural network 40 is configured and trained for three object classes. Thus, at each level, three convolution layers 46 are provided, each of which generates a feature map of an object class. Each convolution layer implements a ReLU activation function in a node. The pooling layer 48 reduces the size of the feature map to generate the input for the next convolution layer 46. The feature maps 50 of the trained segmented neural network each reflect the likelihood that an object of an object class is represented by a conforming element in the input image pixel matrix. Typically, the feature map has a smaller size than the input image pixel matrix. In the decoder portion 44 of the segmented neural network, the feature map is up-sampled and an up-sampled feature map 52 (also referred to herein as a score map) is generated, wherein the elements have a score value reflecting the likelihood that a certain pixel represents an object of the class of objects for which the segmented neural network is trained. To segment the input image pixel matrix, each element (i.e., each pixel) is assigned to one of the object classes for which the segmented neural network is trained. This may be done by using an argmax function (maximum likelihood estimator) which compares the scores of the corresponding elements in the feature score graph for the respective object class. A pixel is assigned to the object class that shows the highest score for that pixel or element, respectively. Thus, one section map 54 is generated from the three feature score maps 52.
According to a preferred embodiment, the applied ReLU activation function is a custom ReLU function that allows for random inactivation of the lead-in nodes, so the split neural network 40 can act as a bayesian neural network.
Fig. 3 illustrates that feature maps 50 and 52 of respective object classes A, B or C function in parallel (rather than in series as may be implied by fig. 2). For example, if the neural network is trained for three objects, three score graphs are generated. The scores in the individual feature score graphs are normalized on a scale between 0 and 1 by means of a SoftMax function. Normalized scores of corresponding elements of the feature graphs may be compared and the element may be assigned to the object class corresponding to the feature score graph having the highest score of the corresponding element. This is done by means of a maximum likelihood estimator using an argmax function.
In fig. 3, section 56 is a section where the activation level of an array element is higher than the activation level of the corresponding array element in the score graphs of object classes a and B. Section 58 is a section of the array element that has an activation level that is higher than the activation level of the corresponding array element in the score map of object classes B and C. Note that each pixel in the segment map 54 is assigned to one of all the object classes for which the split neural network 40 is trained. Thus, there are no unassigned elements or pixels. Fig. 3 may be misleading in this respect, as fig. 3 does not show that all pixels are assigned to a known object class and thus are part of a section.
Fig. 6 shows the generation of a section map 72 from the input image 66 by means of predictions performed by the segmented neural network 40.
The split neural network 40 is part of the data processing unit 14.
Fig. 7 and 8 illustrate assigning all pixels in an input image to a known object. Where the input image 66 '(and thus the input image pixel matrix 66') includes unknown objects such as boxes (boxes) 90 on the street, these unknown objects 90 are assigned to known objects such as street signs 92.
Uncertainty detection
For planning and control purposes, it is helpful to have an indication of the reliability of the data provided by the output data interface 22.
It is therefore an object to determine the level (or amount) of uncertainty involved in the data (i.e. in the semantic image segmentation provided by the neural network).
Another object is to identify image segments or parts of image segments that cannot be reliably assigned to an object. According to one concept of the invention, image areas that cannot be reliably assigned to an object can be identified by the determined uncertainty. Uncertainty refers to model uncertainty according to bayesian statistics; see Kendall and Galhttps://arxiv.org/ abs/1703.04977And in one of their sources, https://www.repository.cam.ac.uk/ bitstream/handle/1810/248538/Ghahramani%202015%20Nature.pdf;jsessionid=04 92F830B1DAA2A6A1AE004766B0E064sequence=1
In a first step, an uncertainty score is created or generated for the pixel level (per pixel) of the input image pixel matrix. This may be achieved by excitation or determining the form of variance of the level of activation (score) of individual pixels with respect to the object class.
If frames of the video sequence are considered or if semantic segmentation is repeatedly performed in multiple passes in which the changed nodes are deactivated, the level of activation (i.e., score) of elements in the feature score graph of the individual object class for which the neural network is trained may change. The variance may be temporal (i.e., between different passes or between different frames of the video sequence) or spatial (i.e., within the image pixel matrix, e.g., at the edges of the segments).
The variance may be achieved by setting the activation to zero with random probability. The random inactivation operation is inserted into the layer to which the random inactivation is to be applied, so that the internal architecture of the layer looks like convolution/random inactivation/nonlinearity.
Specifically, the amount of uncertainty can be determined and quantified by analyzing the amount of variance of the activation level (i.e., the score in the feature score map) over multiple samples. In bayesian neural networks (i.e., convolutional neural networks that are randomly modified, for example, by randomly inactivating nodes), an input image matrix is processed in multiple passes, with some nodes of the convolutional neural network being randomly inactivated for each pass. This technique is known as variational reasoning. Because of the deactivation of the nodes in some passes, the level of activation varies from pass to pass, resulting in an inter-sample variance per pixel. The higher the variance, the higher the amount of uncertainty and therefore the higher the uncertainty score.
The inter-sample variance of the activation level may also be caused by a sequence of pictures using the same scene taken at different points in time. Such a sequence may be a frame of a video sequence recorded by a video camera of a moving vehicle.
If the neural network is not a Bayesian neural network, the per-pixel uncertainty can be determined by determining an amount of inter-class variance or variance pattern (spatial variance) of the activation level.
The uncertainty can also be determined via the reconstruction error of the generated neural network, for example by means of a variational automatic encoder.
An unknown object (i.e., an object for which the neural network is not trained or an object for which the neural network is not identified) results in a high level of uncertainty and thus in a high uncertainty score for pixels representing a segment of the unknown object.
The uncertainty graph represents the uncertainty score for each pixel. An uncertainty map is created by analyzing the activation levels of matrix elements (corresponding to image pixels) in a feature score map of the different object classes for which the split neural network is trained. When monte carlo random inactivation is applied to the segmented neural network to thereby result in variances (inter-sample variances) between samples generated with each pass, an uncertainty score map may be created by analyzing the variances of activation levels over multiple passes. To track the variance over time and generate an uncertainty map, an uncertainty detector may be provided.
Thus, if the unknown object is represented by pixels in the input image pixel matrix, these pixels, and thus the unknown object represented by these pixels, can be "found" by determining an uncertainty score. This may be used to assign segments in the input image pixel matrix to unknown objects and even further discern different unknown objects in such segments in the input image pixel matrix.
The basic steps performed by the system according to the invention are:
-generating semantically segmented image pixel matrix samples using a segmented neural network;
-analyzing inter-class variances and/or inter-frame variances in per-pixel activation levels of individual object classes trained with respect to the neural network, i.e. analyzing per-pixel scores of individual object classes;
-determining uncertainty by assessing an uncertainty level or an amount of uncertainty based on an analysis of inter-class variances and/or inter-frame variances in per-pixel activation levels in the semantically segmented image pixel matrix; and
-identifying a section showing the unknown object according to the determined uncertainty.
Preferably, the system further:
-discriminating between different unknown objects by analyzing the uncertainty; optionally, the first and second heat exchangers are configured to,
-creating a new object class based on an analysis of the uncertainties of pixels in the section determined to display the unknown object.
To perform uncertainty detection, the data processing unit 14 includes an uncertainty detector 60, see fig. 4.
The uncertainty detector 60 may be configured to determine the uncertainty from the feature score map 52 provided by the segmented neural network 40 by analyzing the inter-class variance of the values of the activation level prior to applying the maximum likelihood estimator (i.e., prior to applying the argmax function and generating the segment map 54). Before the maximum likelihood estimator (i.e., before applying the argmax function and generating the segment map 54), each pixel has an activation level of each known class. Pixels are typically assigned to the class with the highest activation level by the argmax function. The inter-class variance of each pixel may be determined before the argmax function is applied.
Alternatively or additionally, the uncertainty detector 60 may include a generating neural network 62 that implements a generating model that reproduces the input image based on the same training data set as the segmented neural network 40. The generating neural network 62 is preferably a variational automatic encoder. The per-pixel reconstruction loss between the input image and the image reconstructed by the generating neural network 62 corresponds to an uncertainty. Higher reconstruction losses reflect a higher amount of uncertainty.
Preferably, the uncertainty detector 60 is configured to implement a bayesian neural network as the means for generating the inter-sample variance and thereby the uncertainty score. In this embodiment, the uncertainty detector 60 is preferably configured to apply a custom ReLU function that allows random inactivation of the incoming nodes, so the split neural network 40 can act as a Bayesian neural network. Thus, in the case of a change in the segmented neural network 40 (e.g., random inactivation of nodes) or the insertion of noise into the weights or activation functions, the uncertainty can be determined by means of several repeated predictions (samples). The level of activation in the score plot on all samples (after the softmax function but before the maximum likelihood estimator (argmax)) will have a high inter-sample variance for the part of the input that does not correspond to the training data, and a low variance for the part of the input that corresponds well to the training data. The tensor library 64 includes custom ReLUs that are used to cause the segmented neural network 40 to be a Bayesian neural network.
In use, data processing unit 14 receives as input a matrix 66 of input image pixels. The data processing unit 14 generates as output a section map 72 and an uncertainty score map 74.
Fig. 9 illustrates an uncertainty score 74 generated from the input image shown in fig. 7. Fig. 8 illustrates a section diagram generated from the input picture shown in fig. 7. As can be derived from fig. 9, the unknown object 90 (i.e., the box on the street) is represented by a segment in which the pixels exhibit a high amount of uncertainty, and thus has a high uncertainty score reflected by the lighter color in fig. 9. In a segment map 74 as shown in fig. 8, pixels representing unknown objects (boxes on the street) are assigned to various known objects. While this is correct in terms of the function of the segmented neural network, the segmentation results are erroneous from the user's point of view with respect to the unknown object involved. Such "false" assignments may be detected by means of an uncertainty score for a pixel in a segment (prior to assigning the pixel to a known object class, e.g., prior to applying a function for classifying individual pixels).
Detailed description of the Square quantity determination
The variance amount may be determined by exciting and/or analyzing a form of variance in the activation level (score) of individual elements of a feature score map of the class of the subject for which the segmented neural network is trained.
The amount of variance between different samples of a feature score graph for an object class may be determined to determine the inter-sample variance.
The amount of variance between feature score graphs for different object classes may be determined, thereby determining the inter-class variance.
These methods may be combined.
Detailed description of uncertainty amount (uncertainty score) determination
Based on the variance as disclosed above
For the corresponding elements of the score graph (i.e. the activation level matrix of the respective object class)
The uncertainty detector may be configured to determine the prediction uncertainty in various ways. The per-pixel uncertainty score indicative of the prediction uncertainty may be determined, for example, by:
-comparing the relative activation levels of the respective pixels in the feature score map by means of values of activation levels before the maximum likelihood estimator, wherein an uncertainty detector for example by means of class-agnostic thresholds with respect to activation levels, by means of class-specific thresholds with respect to activation levels, or according to respective thresholds with respect to differences between activation levels (e.g. the first 2 activation levels) of classes, wherein a higher difference would indicate a lower uncertainty, and vice versa.
Rendering the input image by means of training a generative model (e.g. a variational automatic encoder) on the same training dataset, and exploiting the fact that the variational automatic encoder fails to render the unknown object class. By calculating the reconstruction loss in the prediction mode as in the training mode, a pixel-by-pixel (per pixel) uncertainty score is determined, wherein a higher reconstruction loss will indicate a higher uncertainty.
By implementing a bayesian neural network by means of variational reasoning, for example by applying monte carlo random inactivation in the training mode and the prediction mode of the segmented neural network in combination with sampling in the prediction mode of the segmented neural network, wherein the uncertainty detector exploits the fact that an unknown object class (i.e. an unknown object in the input image pixel matrix) will result in a higher pixel-by-pixel inter-sample variance between the samples, and that a higher variance will indicate a higher uncertainty, and vice versa; or for example by applying gaussian noise to the weight values or activation functions with the same effect as the monte carlo random deactivation.
Detailed description of how to assign uncertainty amounts (uncertainty levels) to individual pixels in an input image pixel matrix
The uncertainty detector may take, for example, the value of the pixel-by-pixel activation level in the feature score map, the reconstruction loss value of the pixel-by-pixel reconstruction loss of the variational automatic encoder, or the value of the pixel-by-pixel variance in the variational derivative, or a combination of these, for determining the uncertainty score for the element in the uncertainty score map. A lower activation level indicates a higher certainty score. A lower variance between the activation levels of the first 2 classes or a higher variance across all object classes indicates a higher uncertainty score. Higher variances between samples in the variational reasoning indicate higher uncertainty scores. In the case of variational reasoning, with per-pixel variances per object class, the uncertainty detector aggregates these variances into one variance per pixel, e.g. by summing or averaging.
Determining a detailed description of a section representing an unknown object (a section in an input image pixel matrix) based on uncertainty scores of individual pixels
Determining a segment of an object representing an unknown object class is performed to determine whether the perception system 10 is operating within its operating domain or not operating within its operating domain. The domain is defined by the class of objects for which the segmented neural network 40 is trained.
Thus, determining that a section of an object representing an unknown object class is part of the out-of-domain detection. In fig. 10 to 13, the outside-domain detection is simply labeled as domain detection. Detection of segments of objects representing new object classes that are not yet known may be used to create a model representing the new object class and thus be used for domain adaptation.
Just like the prediction of a segmented neural network (i.e., a feature score map reflecting the activation levels of the elements of the respective object classes and a segment map derived therefrom), the uncertainty obtained from the variance or activation levels is a pixel image, so that for each pixel in the segment map, exactly one corresponding uncertainty score is found in the uncertainty score map.
If the uncertainty detector comprises a bayesian neural network as means for detecting inter-sample uncertainties and thereby for generating uncertainty scores for pixels or elements, respectively, uncertainties that are not yet affirmative will be shown at the edges of the segments (see fig. 9, light "frames" surrounding the segments). The origin of uncertainty that has not been able to be confirmed is due to the fact that in the tagging process, segments in the training dataset have been tagged manually or semi-automatically by a person who will tag object edges that are sometimes narrower and sometimes wider. The neural network will learn the diversity (diversity) of such tagging patterns and predict segment edges to be sometimes narrower and sometimes wider. This also occurs when samples for variational reasoning are taken in the prediction, gradually leading to uncertainty at the edges of the segment. The uncertainty detector is preferably configured to match the region in the uncertainty map in which uncertainty that has not occurred to be positive with the edges of the predicted segments in the segment map by means of standard computer vision techniques for edge detection, corner detection or region detection (also known as area detection). In case the uncertainty that is not yet positive matches the edges of the predicted section, a correct prediction of the section is indicated. If there is not yet affirmative uncertainty at the edges of the predicted section in the section map, or if there is not yet affirmative uncertainty within the section, an incorrect prediction is indicated.
If the uncertainty detector comprises a bayesian neural network as means for detecting inter-sample uncertainties and thereby for generating uncertainty scores for pixels or elements, respectively, the cognitive uncertainties will be shown inside the segments (see fig. 9, light colored segments 90). Thus, other methods for detecting uncertainty will show uncertainty inside the section. The cognitive uncertainty is directly indicative of incorrect segment predictions, i.e., incorrect segmentations in the segment map. The uncertainty detector is configured to determine the uncertainty region, for example by means of standard computer vision techniques for edge detection, corner detection or region detection (also known as range detection). In addition to the fact that the uncertainty is indicated by pixels in the uncertainty score map that are clustered at the edges of the segments in the segment map or within the uncertainty segments in the segment map, the uncertainty values of these per-pixel uncertainties also change within the segments. Such uncertainty may extend into other segments in the segment map and corresponding regions in the uncertainty score map where the segment predictions are already correct (false positives); or not completely fill in the section where the section prediction is already incorrect (false positive). To capture these false positives and false negatives of the uncertainty score itself, the uncertainty detector may include a small classification neural network (i.e., classifier) that is trained with the uncertainty score map in combination with the tag training dataset. For this training, an object class is used that does not belong to the original training domain of the original segmented neural network (as defined by the object class for which the segmented neural network was originally trained). By means of the classifier, the uncertainty detector optimizes the matching of the determined uncertainty region in the uncertainty score map with the true shape of the incorrectly predicted segment in the segment map, wherein the elements in the uncertainty score map have a high uncertainty score. Thus, the uncertainty score is partitioned from the segment map. The uncertainty detector rationalizes (plausitize) the region determined by the classifier with the region determined by means of standard computer vision techniques.
Fig. 14 illustrates predictions through a neural network and measurement of prediction uncertainty. From the input image on the left, the split neural network calculates class predictions for each pixel on the right, and the uncertainty detector calculates uncertainty value predictions for each pixel. FIG. 13 shows the correlation of an uncertainty shape to an unknown object.
The uncertainty detector may be implemented in different ways.
For bayesian neural networks, where the variance of the segmented neural network (e.g., random inactivation of nodes) or noise is inserted into the weights or activation functions, the prediction uncertainty may be determined by means of several repeated predictions (samples). The level of activation in the score plot on all samples (after the softmax function but before the maximum likelihood estimator (argmax)) will have a high inter-sample variance for the part of the input that does not correspond to the training data, and a low variance for the part of the input that corresponds well to the training data. This embodiment requires that the uncertainty detector interacts with the segmented neural network and randomly modifies the segmented neural network, in particular at least some nodes of the convolutional layer of the segmented neural network. The uncertainty detector may be configured, for example, to modify at least some of the ReLU activation functions of the convolutional layer, resulting in deactivation of the node; or modifying the weight values in some nodes.
For non-bayesian neural networks, the uncertainty value may be determined from only one sample by determining the inter-class variance between softmax scores on the score graphs (each pixel has a softmax score in each score graph). The inter-class variance may be determined, for example, as the difference between the first 2 softmax scores, i.e., the difference between the maximum softmax score and the second largest softmax score; or as the variance between softmax scores across all subject classes. Instead of variance, a threshold may be applied. This embodiment does not require the uncertainty detector to modify the segmented neural network.
As a third alternative embodiment, the uncertainty detector may include generating a neural network (e.g., a variational automatic encoder) to recreate the input image and measure the uncertainty in exactly the same way as described above or by determining a reconstruction loss. The generating neural network (e.g., a variational automatic encoder) implements the same model as the segmenting neural network (i.e., is trained for the same object or class).
Preferably, the uncertainty detector is configured to implement a bayesian neural network as the means for generating the inter-sample variance. However, the sampling process consumes a lot of time, since a statistically relevant number of samples is required for reliable results. Instead of sampling each instance of the data sequence received from the sensor a number of times as in the simplest case, the uncertainty detector is preferably configured to calculate the inter-sample variance over subsequent instances (e.g. subsequent frames of the video sequence) while sampling each instance only once or several times, which is possible if the instance is a matrix of input image pixels corresponding to frames of the video sequence recorded by the vehicle, since the pixels of these instances correspond to each other within a variable shift amount, rotation amount, and scaling amount due to movement of the vehicle. The uncertainty detector is configured to match individual pixels or pixel regions of one instance with corresponding pixels or pixel regions of a subsequent instance to determine the sample-to-sample variance between the feature scores of the pixels.
The region in the uncertainty score map that exhibits the element with the high uncertainty score generated by the uncertainty detector is a candidate for a segment representing an unknown object, i.e., is not the object in the segment map that is assigned to that segment, and therefore is not an object belonging to the class of objects for which the segmented neural network is trained.
A detailed description of the rationality that confirms that the section determined to represent the unknown object does represent the unknown object.
The uncertainty detector is preferably configured to further rationalize the utilization of segments determined to represent unknown objects by means of communication with other vehicles in the vicinity of the scene. The uncertainty detector maps the scene to a global coordinate system, such as the coordinate system of an HD map used by the vehicle for localization. Thus, the position of an object in the global coordinate system corresponding to a segment in the segment map corresponding to a known or unknown object is detected.
If segment maps generated from input image pixel matrices recorded from different locations are compared, segments that are candidates for representing unknown objects may be compared. The matrix of input image pixels from different locations may originate from, for example, two different cameras of one vehicle or from two different cameras of two different vehicles, see fig. 11, 12 and 13.
The uncertainty detector of the first vehicle sends the coordinates of the unknown object to other vehicles in the vicinity of the scene, and thus another uncertainty detector within the receiving vehicle can match the received segment with the corresponding segment of the first vehicle. Matching will increase the probability that the object is indeed an unknown object class. Missing matches will reduce this probability. For the case where the vehicle uses the uncertainty detector output for initiating the minimum risk maneuver, the uncertainty detector does not use this rationalization approach until it is too late, only so far that the entire process of vehicle-to-vehicle rationalization takes no more than a fault-tolerant time interval. This rationalization approach can always be applied for the case of recording potentially unknown objects in the dataset for domain adaptation. In order to determine the fault tolerant time interval, a 3D position of the potentially unknown object is required, which is determined by an uncertainty detector, for example by means of monocular depth estimation or by reasoning from motion.
As a second method of vehicle-to-vehicle rationalization, an uncertainty detector within one vehicle will send information identifying a segment of an unknown object to other vehicles that do not have to be in the vicinity of the scene. An uncertainty detector in the receiving vehicle stitches (patch) the received segments into an uncorrelated scene observed by another vehicle and calculates the uncertainty. If the segment has a high uncertainty in the uncorrelated scene, the uncertainty detector increases a probability value indicating the probability that the object is indeed an unknown object class. If the identified segment has low uncertainty in the uncorrelated scene, the uncertainty detector will decrease the probability value reflecting the probability that the object is indeed an unknown object class, and instead assign the object class determined with low uncertainty to the object. In the event that the computation takes too much time, the uncertainty detector will perform the computation locally within only one vehicle, and the uncorrelated scene will come from the uncorrelated dataset, allowing further rationalization with respect to the tags provided with the dataset.
Automatically generating a detailed description of a new object class based on analysis of segments representing objects that are not yet known
The uncertainty detector will preferably create a dataset of new object classes from the unknown objects identified by the uncertainty detector. The dataset will include sensor input data (i.e., a matrix of input image pixels corresponding to an unknown object) along with a matrix of rationalized uncertainty pixels (in the case of video input) or points (in the case of lidar input) as labels, see fig. 14. However, a set of real object classes needs to be obtained from all instances of objects that are not yet known, along with their labels created by an uncertainty detector that identifies the locations of these objects that are not yet known, but not their real object classes. Preferably, the uncertainty detector is configured to group instances of objects that are not yet known into candidate object classes by means of unsupervised segmentation. Preferably, the uncertainty detector is further configured to determine possible candidate object classes by means of single sample learning in which the visual input is mapped to feature vectors, wherein each feature has an inherent meaning and an inherent relation to other features, and the features are for example descriptions in natural language; see fig. 18.
Detailed description of a vehicle control System including a video camera, a semantic segmentation system connected to the video camera, a vehicle-to-vehicle (V2V) communication device allowing vehicle-to-vehicle communication to exchange object class definitions/representations
Identification of objects that are not yet known (out-of-domain detection) will be performed by an uncertainty detector on the device (e.g., a sensor that integrates our technology with a neural network that performs computer vision tasks). There, the uncertainty detector will record sensor input data corresponding to the unknown object, along with a matrix of rationalized uncertainty pixels or points as labels. For vehicle-to-vehicle rationalization, devices implementing uncertainty detectors require vehicle-to-vehicle connectivity provided by other onboard systems. The creation of the dataset of the new object class may be performed by an uncertainty detector on the device or in the cloud. Thus, the device requires vehicle-to-infrastructure communication, see fig. 13.
Detailed description of Domain adaptation
Domain adaptation is used to extend the operating domain of the perception system. Domain adaptation preferably includes detection outside the domain, for example, by identifying objects belonging to object classes that are not yet known. Domain adaptation preferably includes adaptation or creation of a segmented neural network, whereby objects of one or more new object classes can be predicted by the segmented neural network.
Fig. 15 to 19 illustrate a process of domain adaptation, in particular enabling a separating neural network to predict objects of a new object class. Fig. 18 and 19 illustrate continuous domain adaptation.
Domain adaptation may be performed by an uncertainty detector on the device or in the cloud, see fig. 16 and 17. This can be done by means of various well known techniques, in the simplest case by means of retraining the trained segmented neural network with the most recently determined object class from the dataset of the new object class. In this case, retraining of the segmented neural network means updating the file containing the weight values in the persistent memory. During retraining, the file is updated and the updated file may be loaded by the device, for example, in the next on-off cycle of the device. If retraining is performed in the cloud, the updated file is downloaded from the cloud to the device's persistent memory through the device's software update function.
An operational design domain is a domain in which a system (particularly a perception subsystem) can operate reliably. The operational design domain is defined by, among other things, classes of objects that the perception subsystem can discern.
Domain adaptation means that, for example, the sensing subsystem is recently configured or updated to be able to discern other object classes that are present in the environment in which the sensing subsystem is used.
The data configuring the neural network (i.e., the data defining the weights and activation functions of the nodes and filter kernel arrays in the layer) defines the model represented by the neural network. Thus, "downloading a model" means downloading configuration data for a neural network in order to define a new model, e.g., a model that is able to discern more object classes.
Updating a model with configuration data from another model
When a program instantiates a neural network, the neural network is typically uninitialized. There are then two modes of operation. When the neural network is to be trained, the program will instantiate the neural network model from the software library, and the weights of the neural network model will be initialized with random values here, and the training process will gradually adapt the values of these weights. When training is completed, the weights of the neural network will be saved in a file corresponding to the architecture (structure, topology) of the neural network and the generic storage file format, which depends on the software library used to implement the neural network, such as TensorFlow (Google), pytorch (Facebook), apache MXNet (Microsoft), or ONNX format (open neural network exchange). The intended term for this document is "checkpointing". The checkpoint file represents the model for which the neural network is trained. When the neural network is then to be used for prediction, the program will instantiate again the neural network model from the software library, but for prediction the weights of the neural network model will not be initialized with random values, but with stored values of the weights from the checkpoint file. The neural network may then be used for prediction. The checkpoint file always includes data that will be loaded by the program at the beginning of each on-off cycle of the system.
The checkpoint file includes data configuring the neural network (i.e., data defining the weights and activation functions of the nodes and filter kernel arrays in the layer) that defines the model represented by the neural network. Thus, "downloading a model" means downloading a checkpoint file that contains configuration data for the neural network in order to define a new model (e.g., a model that is able to discern more object classes).
When the neural network is adaptive, the program instantiates a neural network model, which is then initialized not with random data for training, but with weight values from the checkpoint file. The training process is then started using the new domain adaptive dataset. Thus, the training is not started from the beginning, but from a state before the training. There are two modes of adaptation. If the new object class from the domain adaptation dataset is not similar to the object class already known to the segmented neural network, then space must be provided for the new object class to learn, which means that the architecture of the neural network must be changed in order to provide the last layer with as many additional feature score graphs (and thus additional convolution kernels with their weights) as necessary to learn the new object class. In this case, only weights that were previously present in the architecture can be initialized from the checkpoints, and the new convolution kernel will be initialized with random data. On the other hand, if the new object classes are very similar to the known object classes to which they are expected to be generalized, then no change in the architecture of the neural network is required. For training, in both cases, the values of the weights of most layers in the neural network will be frozen (i.e., not modified), and only the last few layers will be trained for adaptation. To update the system with the new segmented neural network, a new checkpoint file is provided. In most cases, no updates to the software libraries will be required, as most libraries can instantiate the neural network model in a variable manner (e.g., with respect to the number of object classes). The software library stores these parameters in the checkpoint file and thus, no changes to the program are required.
Additional training may be achieved by means of small sample learning (i.e. a limited number of training data sets for new object classes).
There are cases where the split neural network 40 is not directly accessible, i.e. in an autonomous driving system comprising a (main) split system.
In such a case, a separate autonomous device 10' is provided in addition to the main segmentation system, for example a mobile device such as a smartphone.
The autonomous device 10' includes: a sensor (e.g., camera 12') for generating a matrix of input image pixels; a segmented neural network 40' (e.g., a segmented bayesian (Baysian) neural network); and an uncertainty detector 60' configured to determine a region having matrix elements (corresponding to pixels of the input image pixel matrix) exhibiting an uncertainty score above a threshold; see fig. 20. The uncertainty score is made up of elements corresponding to pixels in the input image pixel matrix (66), each element of the uncertainty score having an uncertainty score that is determined by the uncertainty detector 60' and reflects the amount of uncertainty involved in the class prediction for the corresponding element in the segment map 72' generated by the autonomous device 10 '. The uncertainty score map may include regions having matrix elements that exhibit an uncertainty score above a threshold and thus represent edge conditions for object classification and image segmentation.
Autonomous device 10' may be used independently or as a second segmentation system.
Preferably, the segmented neural network 40 'of the autonomous device 10' is trained using the same class as the segmented neural network 40 of the primary image segmentation system.
The (main) image segmentation system of the autonomous driving system does not generally provide means for uncertainty detection, whereas the autonomous device 10 'comprises an uncertainty detector 60'. Thus, when an area in the uncertainty score map consisting of elements with high uncertainty scores is found in the region of interest representing the input image pixel matrix of the street in front of the vehicle, the autonomous device 10' may act as a security companion that may generate a user-perceivable warning signal. Regions in the uncertainty score graph that are composed of elements with high uncertainty scores typically represent unknown objects and thus represent edge conditions for object recognition.
The autonomous device may include a new object detector 80 operatively connected to the uncertainty detector 60'. The new object detector 80 is configured to find a region consisting of elements with high uncertainty scores in the uncertainty score map and to generate a new object class for such found region.
The autonomous device 10 'may be configured to exchange tagged data sets including segments assigned to recently generated object classes with other autonomous devices 10' to increase the number of known object classes available for semantic segmentation by the autonomous device.
The autonomous device may even record a response of a user (e.g., driver) in response to a warning signal of the autonomous device. The type of reaction (or no reaction) may be used as input data for distinguishing between related unknown objects and less related unknown objects. The type of reaction (or no reaction) may also be used to automatically generate tags for new object classes. An emergency stop or evasion maneuver as a reaction of the driver indicates a high correlation of unknown objects. The user does not react indicating a low correlation. The automatically generated tags may reflect the degree of relevance of the new object class.
The autonomous device 10' may learn to discern unknown objects by automatically generating new object classes. Moreover, the autonomous device 10' may learn the correlation level of the respective (new) object class and may thus employ a warning signal level for the object class. If the autonomous device 10' determines that an object is present in the region of interest of the input image pixel matrix (by image segmentation), a warning signal may be generated from the identified correlation of the object.
The autonomous device 10' may be configured to exchange data sets including data representing a level of correlation of known object classes or observed user reactions (observed behaviors) with other autonomous devices.
For exchanging data sets with other autonomous devices, the autonomous device may comprise a data interface 82, in particular a wireless data interface 82 allowing data exchange via, for example, the internet.
The autonomous device 10' is preferably a pocket-type mobile device that can be easily carried by a user and easily mounted to, for example, a windshield of a vehicle. Preferably, the autonomous device 10' is mounted in a position wherein the perspective of the autonomous device corresponds at least in part to the perspective of the sensor 12 or sensors of the autonomous driving system.
The autonomous device may comprise a segmented neural network 40' implementing a semantic segmentation model, which is preferably continuously trained with the output of the new object detector 80 as input, enabling the neural network to predict new previously encountered objects.
In a preferred embodiment, the output of the new object detector 80 is saved as a label of a dataset that also includes a corresponding matrix of input image pixels and is therefore suitable as a training dataset, see fig. 21, 22 and 23.
The training data set may be transmitted to other autonomous devices over the internet using the mobile data interface 82 of the autonomous device 10' (i.e., mobile phone).
The autonomous device 10 'is preferably adapted to train using a training data set comprising new objects received from other autonomous devices and thus update the semantic segmentation model implemented by the segmentation neural network 40'. Thus, the autonomous device 10' may be enabled to predict edge conditions previously encountered by other similar autonomous devices, see fig. 24.
Preferably, the segmented neural network 40 'of the autonomous device 10' implements a model that is trained using the output of the new object detector 80 (i.e., the training data set generated by the new object detector) as input and the recorded user responses as secondary inputs in order to learn to predict the user responses when a particular new object is encountered. The source 84 providing data representative of the action performed by the user (i.e., the driver) may be obtained from signals from the vehicle communication bus (in the case where the autonomous device is connected to the vehicle communication bus) or from an algorithm or model that predicts the action from the motion of the vehicle as sensed by a sensor of the autonomous device 10' (e.g., the camera 12' of the autonomous device 10 '), for example, by means of an optical flow. User reactions when encountering a particular new object may be used to generate a new object class and a label for the new object class.
The mobile data interface 82 of the autonomous device 10' may be used to send training data sets along with data representing actions performed by the user to other autonomous devices over the internet.
In a particular use case, the autonomous device 10' may be installed behind the windshield of the vehicle with the camera of the autonomous device facing the direction of travel while executing a program comprising an autonomous travel stack (stack) that is connected to the output of the camera of the autonomous device as input and provides the trajectory determined by the autonomous travel stack to the autonomous travel system of the vehicle through the connection between the autonomous device and the vehicle communication bus.
In another use case, the autonomous device 10' may be installed behind the windshield of the vehicle with the camera facing the direction of travel, executing a program implementing the prediction system of the autonomous travel stack, which program is connected to the output of the camera of the autonomous device as input, and providing the autonomous travel system of the vehicle with the list of objects determined by the perception system through the connection between the autonomous device and the vehicle communication bus.
At the pixel level, the output of the new object detector 80 is a label, where such an unknown object is labeled as a new object, while everything else is labeled as a background and assigned to an ignore class. The sensor input is used with the tag as a training dataset. Elements assigned to the ignore class do not cause loss when a loss function is applied during training of the semantic segmentation model with a training dataset comprising new objects.
Thus, the training data set generated by the autonomous device 10' may be easily shared with other devices without having to shift a large amount of data, since the training data set only includes data relevant for the new object class.
The autonomous device 10' can easily receive as input training data sets of new object classes from other autonomous devices, see fig. 25.
Furthermore, when a detected new object is inserted into the context of another autonomous driving system or another autonomous device that encounters a similar situation, false positive new objects due to the context (i.e., objects that are not new but are already known but may cause high uncertainty scores at the pixel level due to other circumstances) may be identified by determining the uncertainty of the detected new object.
When a new object is inserted into a known context by an autonomous driving system or another autonomous device, false alarms of the new object due to the context can be avoided by determining an uncertainty of the new object.
Another aspect is a perception system 90 in which a secondary system 92 and a primary system 94 of a vehicle are combined to form a system that can determine areas with elements exhibiting high levels of uncertainty even in the matrix provided by the primary system. This is important because the host system may be a proprietary system and thus a black box that is not externally accessible. Thus, the host system may not be modified to generate variances (e.g., by means of monte carlo stochastic inactivation) to thereby determine uncertainty scores. However, the additional secondary system may be configured similar to the autonomous system 10' previously disclosed herein and is thus capable of identifying regions having elements exhibiting high uncertainty scores.
The host system 94 includes the segmented neural network 40 and an object list generator 96 that generates an object list corresponding to the segments generated by the segmented neural network 40. Objects are aligned, associated, fused, and managed in the sensor fusion system 98. The sensor fusion system 98 is implemented by a bayesian filter, which is a probabilistic robotic algorithm. The algorithm indicates its uncertainty level to its successor (successor), for example by means of a kalman gain. Based on this uncertainty, the latter will select either the sensor input or the prediction made by the bayesian filter. However, the host system 94 is not Bayesian, so the Bayesian filter in the sensor fusion system 98 assumes static uncertainty and is calibrated only once for each sensor. The use of the secondary system 92 enables the perception system 90 to provide uncertainty, see fig. 26.
The secondary system 92 preferably includes: a plurality of parallel redundant uncertainty detectors 60 (e.g., bayesian neural networks, variational automatic encoders), an object detector 80 trained with edge conditions that have been encountered, and a new object detector 80 trained with new objects that have been encountered but are to be suppressed.
Preferably, the outputs of multiple parallel redundant uncertainty detectors employing a parallel redundancy architecture are matched on a per-pixel basis, e.g., by a per-pixel sum or by a per-pixel maximum over uncertainty or by means of a bayesian filter (e.g., a particle filter).
Preferably, the semantic segmentation model is trained with the new object training dataset as input to the semantic segmentation model and the output of the one or more parallel redundant uncertainty detectors are fed as input to a supervisor model that is trained to predict semantic segmentations having two classes, indicating whether the entire segmentation is properly considered or improperly predicted.
Preferably, uncertainty that has not yet been confirmed with segment boundaries in semantic segmentation is matched by means of a rule-based system.
Preferably, the pixel uncertainties are aggregated into a segment uncertainty, for example by means of a threshold for the sum or variance of the pixel uncertainties.
In another preferred embodiment, the semantic segmentation model is trained with the new object training dataset as input to the semantic segmentation model and the output of one or more parallel redundant uncertainty detectors are fed as input to a supervisor model trained to predict semantic segmentations with two classes, indicating whether the whole segmentation is properly considered or improperly predicted.
To train the bayesian segmented neural network 40 with a training data set from the segmented neural network with an unknown structure by means of transfer learning, a system as shown in fig. 27 may be provided. The system is configured to determine a loss between a segment map 54A generated by the unknown segmented neural network from the input image pixel matrix of the training data set and a segment map 54B provided with the training data set. The segment map 54B provides labels for the input image matrix generated by the segmented neural network that has generated the training dataset. The system shown in fig. 27 is configured to determine the loss (determined by the loss function 76) between the segment map 54A generated by the unknown segmented neural network and the segment map 54B belonging to the training dataset. A high loss indicates where the label provided with the training data set is different from the label (i.e., segment) generated by the unknown segmented neural network. Elements or pixels exhibiting high losses may be assigned to the classes of ignores, thereby avoiding that these image portions may influence the training of the bayesian segmented neural network by a transfer learning using a training data set, e.g. generated by the autonomous system disclosed above. In other words, the loss function 78 used to train the Bayesian segmentation neural network 40 will ignore elements assigned to the ignore class by means of the loss function 76.
The bayesian split neural network 40 may be, for example, a split neural network of the autonomous device 10', while the split neural network of the unknown training may be a split neural network of the autonomous driving system.
Fig. 28 illustrates training of the variational automatic encoder 62' by transfer learning using the data set as input to a (known or unknown) trained segmented neural network. The dataset is used as input to a variational automatic encoder. However, when training the variational automatic encoder, the dataset used as input to the variational automatic encoder is modified by calculating the loss of pixels predicted by the trained segmented neural network and assigning pixels with high loss to the ignore class.
In other embodiments, the variational automatic encoder 62' may implement a known model of a trained segmented neural network.
Fig. 29 illustrates uncertainty detection by means of a variational automatic encoder 62' (as a generating neural network) in order to determine pixels with high losses due to uncertainty. In the case where an input data set (e.g., an input image pixel matrix) is used as an input, for example, to a trained neural network that includes semantic segmentation of an unknown object, the pixels representing the unknown object should exhibit a high uncertainty score. The input feeds the input dataset 66 as input to the variational automatic encoder 62' and the loss function 80. The prediction 82 of the variation automatic encoder 62 'is also fed to a loss function 80 to determine the loss between the input image pixel matrix dataset 66, which may include data representing an unknown object, and the prediction 82 (output dataset) of the variation automatic encoder 62'. Thus, the loss of pixels predicted by the trained neural network may be calculated, and an uncertainty score 74' may be generated accordingly. In fig. 22, the uncertainty score map 74' includes a section of pixels with high uncertainty scores, where the pixels of the input image pixel matrix 66 represent unknown objects (yellow boxes on fig. 29).
To train the variant automatic encoder (see fig. 28), pixels with high losses can be ignored. In a preferred embodiment, the variational automatic encoder to be used for determining uncertainty (i.e., pixels with high loss) configures a Cheng Beishe s neural network that applies the variation, for example, by monte carlo random inactivation. Thus, the reliability of the loss determined by means of the variational automatic encoder can be determined by variational reasoning.
FIG. 30 illustrates the use of single sample learning to cause an uncertainty detector to suggest the most likely class of pixels belonging to an unknown object based on similarity to known classes. The method may also be used to generate automatically tagged new object classes.
Fig. 31 illustrates that multiple parallel redundant uncertainty detectors 60 may be implemented in a parallel redundancy architecture, where all uncertainty detectors 60 share a common encoder network.
In fig. 32, an alternative device 100 for object recognition is illustrated. The device 100 may be a smart phone that may be mounted behind a windshield of a vehicle or hand held. The device 100 may be equipped with one or more cameras 102 for generating a video stream 104 that is fed to a neuro engine 106 or similar processing unit. The neuro-engine 106 is configured to generate a list of objects 108. To feed the object list 108 to other devices, an output terminal 110 is provided. The output terminal may be, for example, a USB terminal (universal serial bus) or an ethernet terminal. The output terminal 110 may also be a wireless terminal using a wireless local area network (WLAN, wiFi) protocol or a bluetooth interface.
Moreover, an input terminal may be provided in addition to or in lieu of one or more cameras 102 in order to receive video streams. The input terminal may be a universal serial bus terminal (USB terminal), an ethernet terminal or a wireless local area network terminal or a bluetooth terminal. Such an input terminal enables the device 100 to receive video streams from one or more external cameras, devices or smartphones connected to the device 100.
The device 100 may be configured (by means of the neural engine 106) to generate a list of objects and provide the list of objects to other devices. The object list 108 may include a location of the object with two-dimensional or three-dimensional coordinates that preferably locate the object in a camera coordinate system. Furthermore, each location is preferably annotated with a class, e.g., the class of the identified object. The location may be annotated with a time interval and the location may also be annotated with an uncertainty score. Thus, the object list and location 108 includes, for each identified object, an identifier of the object, coordinates identifying the location of the identified object in the camera coordinate system, a timestamp providing information about the time interval when the identified object was in the location identified by the coordinates, and an uncertainty score providing information about how reliable the object identification is with respect to the particular identified object.
The position of the identified object may be encoded, for example, by a polygon surrounding the object in polar coordinates. Since the object is identified in frames of the video stream, it is also possible to determine how the position of the object changes from frame to frame. This allows the generation of a trajectory combining a position with two-or three-dimensional coordinates with optional annotations and future time. Such a trajectory describes the position of an object expected at that time in the future. The annotated list of objects and locations may be forwarded to another connected device via any of the interfaces mentioned earlier (i.e. USB, ethernet, WLAN and/or bluetooth).
The device 100 may also be adapted to provide a video output 112 and/or an acoustic signal or message via an audio output 116. The device may also be configured to trigger the start or stop of recording.
Fig. 32 illustrates a configuration of a neural engine suitable for object recognition. The neural engine 106 implements a segmented neural network 120 having an encoder module 122 and a decoder module 124. As described further herein and above, encoder 122 includes a downsampling module 126 that includes an input layer for downsampling an input image pixel matrix (e.g., frames of a video stream). The layer for downsampling the input image pixel matrix may implement a variational reasoning (as further described above) with partial model sampling, and/or with continuous video sampling, and/or with the benefit of the Xi Ci constraint, for spectral normalization of the softmax values. Furthermore, the layer for downsampling may be configured to process simultaneous inputs or to process inputs at a plurality of different points in time. This may be achieved by time shifting of the video frame or the input image pixel matrix, respectively. In addition to downsampling, encoder 122 also provides feature extraction module 128. Again, the encoder layer for feature extraction may implement time shifting, and/or optionally with partial model sampling, and/or with continuous video sampling, and/or with variational reasoning with the risperidone Xi Ci constraints, for spectral normalization of the softmax values. The results of the feature extraction are provided to a feature fusion module 130 of the decoder 124 of the segmented neural network. Feature fusion is preferably achieved by means of kalman features, wherein feature matching is achieved via feature locations determined by the feature extraction module 128 of the encoder module 122. The feature tensor generated by the feature extraction module 128 of the encoder 122 captures spatial details of the input image pixel matrix. After feature fusion by feature fusion module 130, classifier module 132 of the decoder classifies each pixel of the input image pixel matrix by assigning it to one of the object classes for which convolutional neural network training is to be used.
The segmented neural network as shown in fig. 32 may be used for object recognition and is capable of recognizing an object (i.e., determining a segment having pixels belonging to a certain object class for which the segmented neural network is trained and encoding the segment by using polygons surrounding the recognized object). Preferably, the polygons are annotated with the class of identified objects.
For object recognition, the segmented neural network of fig. 32 includes an encoder-decoder architecture, and receives images (i.e., frames) from an input video stream. The segmented neural network utilizes pixel-wise segment-predicted softmax values to generate a feature map per object class. The encoder 122 of the semantic segmentation model of fig. 33 includes: the downsampling module 126 for downsampling the input image pixel matrix and the feature extraction module 128 for feature extraction as shown by the dashed lines may configure the feature extraction module 130 to output information about context and/or information about spatial details. Information about the context and/or spatial details may be fed to the feature fusion module 130 of the decoder, allowing location and/or context based feature fusion.
The feature fusion module 130 of the decoder is optional and is preferably provided in the case of multiple inputs from different layers of the encoder. Preferably, at least one input is provided directly from the last layer of the downsampling module 126 of the encoder 12. Further, one or more inputs from the inner layer of the feature extraction module may be provided to the feature fusion module.
The classification is performed pixel by pixel.
Alternatively, the semantic segmentation module as shown in fig. 33 may receive secondary input from the depth estimation module as illustrated in fig. 37.
The semantic segmentation module (segmented neural network) may also implement polygon regression.
Further, the annotated object list may include annotations with a relative uncertainty score that indicates the reliability of the classification and, if the uncertainty score is high, potentially suggests an object class that is not yet known. Further, the object list generated by the segmented neural network as shown in fig. 33 may include annotations identifying proposed similar object classes with a high degree of uncertainty.
In a preferred embodiment of the apparatus 100, the neuro-engine 106 is configured to extract features from one or more matrices of input image pixels (i.e., frames of a video stream). The model implemented by the neural network of the neural engine 106 may be configured to extract one or more of the following features:
object class (fig. 32 and 33)
Gesture class (fig. 34, 35, 36, 37)
-action class
Consciousness class (fig. 38), and/or
Forecasting of action classes (FIGS. 39 and 40)
For feature extraction, the neural engine 106 implements one or more neural networks, preferably with encoder-decoder architecture.
According to the features to be extracted, the input data set fed to the respective input layer of the downsampling module of the encoder of the respective neural network depends on the features to be extracted.
To identify object classes, a segmentation neural network implementing a semantic segmentation module as illustrated in FIG. 33 may be used. The input data set is an input image pixel matrix (frame) of one or more video input streams provided by one or more cameras.
If the feature to be extracted is a gesture class signaled by a person or vehicle (preferably encoded by the polygon surrounding the object), a neural network implementing a video action or recognition model as illustrated in fig. 35 may be used, optionally receiving a secondary input from a depth estimation model as illustrated in fig. 37.
If the feature to be extracted is a conscious class of person or vehicle (preferably encoded by polygons surrounding the object), a neural network implementing a semantic segmentation model as illustrated in FIG. 33 and/or a video action recognition model as illustrated in FIG. 35 may be used.
In order to forecast the class of actions to be performed by a person or vehicle in the future (preferably encoded by polygons surrounding an object), a neural network implementing an action forecast model as illustrated in fig. 40 may be used. The model is preferably configured to receive a video input stream.
Regarding the output of the respective neural network, it is preferable that:
object recognition, the semantic segmentation model preferably provides a list with object classes and positions, see fig. 33;
-gesture class recognition, the video action recognition model preferably provides one anchor point (anchor) per recognized object, which encodes polygons of the object class surrounding the gesture of the action performed by the object;
motion class recognition, the video motion recognition model as illustrated in fig. 35 (preferably receiving secondary input from the depth estimation model as illustrated in fig. 37) provides an annotation of the motion class performed by the recognized object, preferably a person or a vehicle;
-recognition of a class of consciousness of a person or a vehicle, the neural network preferably providing an annotation of the class of consciousness of a person or vehicle recognized by means of the semantic segmentation module of fig. 33 and/or the video action recognition model of fig. 35; and
by means of the action forecast module as illustrated in fig. 40, a forecast of the class of actions to be performed by a person or a vehicle (preferably encoded by polygons surrounding the object), one anchor point per object is generated for the respective object identified by the semantic segmentation module of fig. 33, which anchor point encodes the time interval surrounding the polygon of the object, the indicator of the class of actions to be performed by the object, and the class of intended actions.
When comparing the models illustrated in fig. 33, 35, 37, and 40, it is apparent that these models may share the same encoder with the same downsampling module and the same feature extraction model. Furthermore, the models may share the same feature fusion module of the respective decoder. However, the classification module of the corresponding decoder varies according to the feature to be extracted.
This allows the neural engine to be implemented with a multi-headed architecture as illustrated in fig. 31, where one encoder provides outputs to different decoders.
Implementation details of the model as illustrated in fig. 33, 35, 37, and 40 can be summarized as follows:
for object recognition, a semantic segmentation model according to fig. 33 is preferably provided.
The model preferably implements an encoder-decoder architecture and receives images from a video input stream.
The semantic segmentation model utilizes pixel-wise block-predicted softmax values to generate a feature map for each class.
Encoder 122 includes a downsampling module 126 for downsampling a matrix of input image pixels.
Encoder 122 also includes a feature extraction module 128 for feature extraction.
The encoder preferably provides parallel instantiation of context and/or spatial details (see dashed lines). This means that information about the context or spatial details is fed to the feature fusion module 130 to facilitate feature fusion.
In the case of multiple inputs from different layers of the encoder 122, the decoder 124 preferably includes a feature fusion module 130.
The feature fusion module 130 preferably receives one input directly from the last layer of the downsampling module 126, and/or preferably one input from the last layer of the feature extraction module 128, and/or one or more inputs (instantiations of context and spatial details) from the inner layers of the feature extraction module 128.
Decoder 124 preferably includes a classification module 132 for pixel-by-pixel classification.
The semantic segmentation model may receive secondary inputs from a depth estimation model as illustrated in fig. 37, and their combinations are summarized in fig. 51 and below.
The semantic segmentation model preferably implements polygon regression.
In the semantic segmentation model as illustrated in fig. 33, all dashed connections are optional. A time shift (10) is added which improves semantic prediction by means of providing simultaneous input of m different time points. A further preferred feature is a variational reasoning (13) with partial model sampling (12), with continuous video sampling (11) and with a lipp Xi Ci constraint (14) for spectral normalization of softmax values.
For gesture class recognition, a video motion recognition model according to fig. 35 is preferably provided.
The video motion recognition model preferably implements an encoder-decoder architecture and receives a video input stream.
The video action recognition model generates an anchor point for each recognized object that encodes polygons surrounding the recognized object and gesture or action classes performed by the recognized object.
Encoder 122 includes a downsampling module 126 for downsampling a matrix of input image pixels.
Encoder 122 also includes a feature extraction module 128 for feature extraction.
Encoder 122 preferably provides parallel instantiation of context and/or spatial details (see dashed lines). This means that information about the context or spatial details is fed to the feature fusion module 130 to facilitate feature fusion.
Encoder 122 is preferably configured as a polygon-by-polygon regression of polygons, and/or classification of objects, and/or classification of gestures or actions.
Encoder 122 preferably includes one or more time shifting modules, as illustrated in fig. 36. Such a time shift module 140 may be inserted into the downsampling module 126, and/or into the feature extraction module 128, and/or into the feature fusion module 130.
In the case of multiple inputs from different layers of the encoder 122, the decoder 124 preferably includes a feature fusion module 130.
The feature fusion module 130 preferably receives one input directly from the last layer of the downsampling module 126, and/or preferably one input from the last layer of the feature extraction module 128, and/or one or more inputs (instantiations of context and spatial details) from the inner layers of the feature extraction module 128.
Decoder 124 preferably includes a classification module 132 for generating an anchor point per object that encodes polygons surrounding the object and the performed gesture or action class.
The video motion recognition model may receive secondary inputs from a depth estimation model as illustrated in fig. 37, and their combinations are summarized in fig. 51 and below.
In the video action recognition model as illustrated in fig. 35, all dashed connections are optional. Optional connectors 7, 8, 9 are added for parallel instantiation of context and spatial details. The classification module 132 is based on the header (head) in Poly-YOLO,https://arxiv.org/abs/2005.13243an optional time shift module 140 is added to detect actions over time by means of a simultaneous input (10) providing m different points in time. A further preferred feature is a variational reasoning (13) with partial model sampling (12), with continuous video sampling (11) and with a lipp Xi Ci constraint (14) for spectral normalization of softmax values.
The classification module 132 of the video motion recognition model as illustrated in fig. 35 and the classification module 132 of the motion forecast model as illustrated in fig. 40 are trained differently, i.e. with different training data sets, and thus are different.
The depth estimation model as illustrated in fig. 37 preferably implements an encoder-decoder architecture.
Encoder 122 includes a downsampling module 126 for downsampling a matrix of input image pixels.
Encoder 122 also includes a feature extraction module 128 for feature extraction.
Encoder 122 preferably provides parallel instantiation of context and/or spatial details (see dashed lines). This means that information about the context or spatial details is fed to the feature fusion module 130 to facilitate feature fusion.
Encoder 122 preferably includes one or more time shifting modules, as illustrated in fig. 36. Such a time shift module 140 may be inserted into the downsampling module 126, and/or into the feature extraction module 128, and/or into the feature fusion module 130.
In the case of multiple inputs from different layers of the encoder 122, the decoder 124 preferably includes a feature fusion module 130.
The feature fusion module 130 preferably receives one input directly from the last layer of the downsampling module 126, and/or preferably one input from the last layer of the feature extraction module 128, and/or one or more inputs (instantiations of context and spatial details) from the inner layers of the feature extraction module 128.
The decoder 124 preferably includes a classification module 132 for depth estimation.
The depth estimation model may receive secondary inputs from a semantic segmentation model as illustrated in fig. 33, and their combinations are summarized in fig. 51 and below. The video action recognition model may also generate annotations regarding the relative values of uncertainty that indicate an unknown gesture class or action class, and/or that another object is mistaken for a person or vehicle.
In the depth estimation model as illustrated in fig. 37, all dashed connections are optional. AddingOptional connectors 7, 8, 9 for parallel instantiation of context and spatial details. The classification module 132 is based on the header (head) in Poly-YOLO,https://arxiv.org/abs/2005.13243an optional time shift module 140 is added to detect actions over time by means of a simultaneous input (10) providing m different points in time. A further preferred feature is a variational reasoning (13) with partial model sampling (12), with continuous video sampling (11) and with a lipp Xi Ci constraint (14) for spectral normalization of softmax values.
To forecast the action class, an action forecast model according to fig. 40 is preferably provided.
The motion forecast model preferably implements an encoder-decoder architecture and receives a video input stream.
The video action forecast model generates an anchor point for each identified object that encodes time intervals around the polygons of the identified object, the class of actions to be performed by the object, and the class of expected actions.
Encoder 122 includes a downsampling module 126 for downsampling a matrix of input image pixels.
Encoder 122 also includes a feature extraction module 128 for feature extraction.
Encoder 122 preferably provides parallel instantiation of context and/or spatial details (see dashed lines). This means that information about the context or spatial details is fed to the feature fusion module 130 to facilitate feature fusion.
Encoder 122 is preferably configured as a polygon-by-polygon regression of polygons, and/or classification of objects, and/or classification of gestures or actions.
Encoder 122 preferably includes one or more time shifting modules, as illustrated in fig. 36. Such a time shift module 140 may be inserted into the downsampling module 126, and/or into the feature extraction module 128, and/or into the feature fusion module 130.
In the case of multiple inputs from different layers of the encoder 122, the decoder 124 preferably includes a feature fusion module 130.
The feature fusion module 130 preferably receives one input directly from the last layer of the downsampling module 126, and/or preferably one input from the last layer of the feature extraction module 128, and/or one or more inputs (instantiations of context and spatial details) from the inner layers of the feature extraction module 128.
Decoder 124 preferably includes a classification module 132 for generating an anchor point for each identified object that encodes a time interval surrounding the polygon of the identified object, the class of action to be performed by the object, and the class of intended action.
The action forecast model may receive secondary inputs from the semantic segmentation model as illustrated in fig. 33, and their combinations are summarized in fig. 51 and below. The action forecast model may receive secondary inputs from the depth estimation model as illustrated in fig. 37, and their combinations are summarized in fig. 51 and below. The motion forecast model may receive secondary inputs from a video motion recognition model as illustrated in fig. 35, and their combinations are summarized in fig. 51 and below.
The video action recognition model may also generate an annotation on the relative value of uncertainty indicating a situation where another object is mistaken for a person or vehicle.
In the action forecast model as illustrated in fig. 40, all dashed connections are optional. Optional connectors 7, 8, 9 are added for parallel instantiation of context and spatial details. The classification module 132 is based on the header (head) in Poly-YOLO,https://arxiv.org/abs/2005.13243an optional time shift module 140 is added to detect actions over time by means of a simultaneous input (10) providing m different points in time. A further preferred feature is a variational reasoning (13) with partial model sampling (12), with continuous video sampling (11) and with a lipp Xi Ci constraint (14) for spectral normalization of softmax values.
The classification module 132 of the video motion recognition model as illustrated in fig. 35 and the classification module 132 of the motion forecast model as illustrated in fig. 40 are trained differently, i.e. with different training data sets, and thus are different.
Training of the classification module 132 of the action prediction model as illustrated in fig. 40 may be performed using unsupervised learning of labels for training data sets generated by the video action recognition model as illustrated in fig. 35. To this end, a time shift (time shift) is applied to learn the expected motion for motion forecast from a sequence of previously identified motions identified by a video motion identification model as illustrated in fig. 35.
As shown in fig. 33, 35, 37 and 40, the time shift for detecting the action over time is preferably achieved by means of a simultaneous input providing different points in time. To generate such a time shift, a time shift module 140 as illustrated in fig. 36 may be provided. The time shift module 140 is based onhttps://arxiv.org/abs/1811.08383. Optional support for m features in the time dimension, rather than just one feature, is added.
Fig. 34 illustrates an example of gesture recognition. The gesture to be recognized is "head rotation" of the recognized object "rider", or "signaling turn" of the recognized object "rider". Fig. 38 illustrates consciousness recognition. Depending on the orientation of the rider's head, the identified object "rider" is provided with annotations representing awareness classes, i.e., "unaware" or "aware" in all other cases as exemplified herein, for each class annotation (e.g., "unaware" or "aware"), an uncertainty score may be determined and assigned to the class annotation.
FIG. 39 illustrates a motion forecast for an identified object "rider". From this situation (identified lane and identified non-moving object "car"), a forecast of the position of the object "rider" is generated.
The above model may be used to implement, for example, the following functions:
a. obtaining the controllability of a situation from the consciousness of a person or vehicle of a self-vehicle
b. Obtaining risk estimates for individuals or vehicles
i. In the case of severity
1. Based on their class
2. And/or based on gestures or actions they perform
3. And/or based on actions they expect to perform
4. And/or based on their protectiveness
5. And/or based on their distance to the own vehicle
6. And/or based on their movement toward the own vehicle
7. And/or based on their movement away from the own vehicle
8. And/or based on their acceleration
9. And/or based on their deceleration
in the case of controllability
1. Based on their awareness of self-vehicles
2. And/or based on gestures or actions they perform
3. And/or based on the time they are available to react to the behavior of the own vehicle
a. Based on their distance to the own vehicle
b. And/or based on their movement toward the own vehicle
c. And/or based on their movement away from the own vehicle
d. And/or based on their acceleration
e. And/or based on their deceleration
integration by means of probability filters, e.g. Kalman filters
c. Obtaining the velocity of each object
i. According to the depth variation (first derivative)
d. Obtaining acceleration and/or deceleration of individual objects
i. According to the variation of depth (second derivative)
e. Obtaining possible trajectories for persons or vehicles based on possible actions that are likely to occur in the future
f. Triggering the start or stop of image or video recording
i. With respect to the appearance or disappearance of object classes
Or about the appearance or disappearance of gesture classes
Or line of sight for self-vehicle entering or leaving person or vehicle
Or about the appearance or disappearance of action classes
v. or regarding the possibility of predicting the occurrence of an action within a time interval
Or about the appearance or disappearance of uncertainty-based edge conditions (i.e., unknown object classes or action classes)
Or whether the predicted and actual actions are not the same
Learning triggers
The above model may also be used to implement, for example, the following use cases:
a. for any action performed by a person or vehicle (the model of which has been trained to recognize)
i. Generating an alert
Video recording of license plate annotations for and/or utilizing vehicles
Providing this information to autonomous travel systems
When a person, rider or motorcyclist is about to cross a street
1. Generating an alert
2. And/or providing the autonomous driving system with the information
When a rider, motorcyclist or vehicle signals a turn or change of track
1. Generating an alert
2. And/or providing the autonomous driving system with the information
When a person, rider or motorcyclist is unaware of the self-vehicle
1. Generating an alert
2. And/or providing the autonomous driving system with the information
When a person, rider or motorcyclist becomes aware of the self-vehicle
1. Generating an alert
2. And/or providing the autonomous driving system with the information
viii when children play on sidewalk or roadside
1. Generating an alert
2. And/or providing the autonomous driving system with the information
When emergency or police vehicle presents a flash lamp
1. For example blue light (e.g. Germany)
2. And/or red or orange light, for example (e.g., U.S.)
3. Generating an alert
4. And/or providing the autonomous driving system with the information
b. When a person or vehicle performs unusual actions (the model of which has not been trained to recognize)
i. Generating an alert
Video recording of license plate annotations for and/or utilizing vehicles
Providing this information to autonomous travel systems
c. For any situation where a person or vehicle appears and the model has been trained to recognize
i. Generating an alert
Video recording of license plate annotations for and/or utilizing vehicles
Providing this information to autonomous travel systems
When a construction site exists
1. Generating an alert
2. And/or providing the autonomous driving system with the information
Identifying street signs and displaying the latest identified street signs on a screen
1. Identifying speed limits from street signs
2. Generating a warning when the actual speed deviates from the speed limit
a. According to configurable amounts or fractions
3. Identifying a street sign with an attached tag
a. And/or display facts on a screen
4. And/or providing the autonomous driving system with the information
When clearance with obstacles becomes too low, and/or if this happens too fast
1. Generating an alert
2. And/or providing the autonomous driving system with the information
d. When a person or vehicle is present in an unusual situation (the model of which has not been trained to identify)
i. Generating an alert
Video recording of license plate annotations for and/or utilizing vehicles
Providing this information to autonomous travel systems
e. Learning a particular gesture or action
i. Triggering video recording
Relieving the warning
Starting app
Calling contacts or releasing calls
Starting or stopping playing media
f. According to a particular gesture or action
i. Triggering video recording
1. For example, a person performing a particular exercise while performing a physical exercise
a. Triggering the start of recording at the start of a particular workout
b. And/or triggering the recording to stop at the end of the exercise
Relieving the warning
Triggering the launch of an app on a handset
Triggering or releasing a calling contact
Triggering the start or stop of playing media
g. For any risk about anyone or vehicle
i. Crossing (cross) configurable threshold
Generating a warning
Facts are displayed and/or displayed on screen
Providing this information to autonomous driving systems
FIG. 41 illustrates a use case of uncertainty identification.
For uncertainty identification, a cascade monitoring concept as illustrated in fig. 42 may be used.
The cascade monitoring concept 150 may include two parallel situation monitors 152 to identify edge conditions with respect to the data set to obtain a segment from pixel-by-pixel uncertainty encoded by polygons surrounding the segment.
The master situation monitor 152 implements a model that matches pixel-by-pixel reconstruction losses of the automatic encoder model to the segments by means of a model that has the same topology as the classification module of the decoder of the motion recognition model. The model may be configured to match pixel-by-pixel cognitive uncertainty with a segment boundary having pixel-by-pixel cognitive uncertainty provided by the segmentation model and/or by the depth estimation model and/or optionally by the auto-encoder model.
Alternatively, the model may be configured to match uncertainty that is not yet positive pixel-by-pixel with the segment boundaries to identify cases where multiple overlapping segments have the same class and there is insufficient knowledge of the uncertainty.
FIG. 43 illustrates a Poly-YOLO based head situation monitor 154-https://arxiv.org/abs/ 2005.13243)。
The situation monitor 154 implements an automatic encoder model 160 to provide reconstruction losses.
The model preferably implements an encoder-decoder architecture and receives images from a video input stream.
The semantic segmentation model utilizes pixel-wise block-predicted softmax values to generate a feature map for each class.
Encoder 122 includes a downsampling module 126 for downsampling a matrix of input image pixels.
Encoder 122 also includes a feature extraction module 128 for feature extraction.
The encoder preferably provides parallel instantiation of context and/or spatial details (see dashed lines). This means that information about the context or spatial details is fed to the feature fusion module 130 to facilitate feature fusion.
In the case of multiple inputs from different layers of the encoder 122, the decoder 124 preferably includes a feature fusion module 130.
The feature fusion module 130 preferably receives one input directly from the last layer of the downsampling module 126, and/or preferably one input from the last layer of the feature extraction module 128, and/or one or more inputs (instantiations of context and spatial details) from the inner layers of the feature extraction module 128.
The decoder 124 preferably includes a classification module 132 for reconstruction of the input image and calculation of pixel-by-pixel reconstruction loss.
Fig. 44 illustrates an automatic encoder model 160. All dashed connections are optional. Optional connectors 7, 8, 9 are added for parallel instantiation of context and spatial details. The depth estimation module 132 is independently designed with symmetry to the classification module. An optional time shift module 140 is added to detect actions over time by means of a simultaneous input (10) providing m different points in time. A further preferred feature is a variational reasoning (13) with partial model sampling (12), with continuous video sampling (11) and with a lipp Xi Ci constraint (14) for spectral normalization of softmax values.
Another situation monitor 152 implements a model that matches pixel-by-pixel reconstruction loss of the automatic encoder model with segments.
The other situation monitor 152 includes two parallel validity monitors 156 and 158. One validity monitor is a bayesian validity monitor 158 that includes:
-a random inactivation layer inserted in the convolution sub-module, the random inactivation layer being used for variational reasoning to obtain uncertainty by means of sampling in the random inactivation case; and
-a sampler for sampling the model with partial model sampling and/or with continuous video sampling.
Matching the found edge condition to the segment can be found by means of a probability filter, e.g. a kalman filter.
Fig. 45 illustrates a convolution sub-module.
Fig. 46 illustrates a convolution sub-module with random deactivation. The module may be based onhttps://arxiv.org/ abs/1703.04977Andhttps://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf
fig. 47 illustrates a bayesian sampling module.
Another validity monitor 156 may be a Li Puxi-z validity monitor inserted into the convolution sub-module, the Li Puxi-z validity monitor normalizing the softmax to correlate to an uncertainty provided to the situation monitor, optionally by means of a softmax value, and/or a softmax variance provided to the situation monitor, optionally.
Fig. 48 illustrates a Li Puxi z sub-module. The module may be based onhttps://arxiv.org/abs/ 2102.11582Andhttps://arxiv.org/abs/2106.02469
uncertainty may be provided to the probability filter.
Fig. 49 illustrates the integration of a situation monitor with a kalman filter.
Preferably, a similar class in the case of uncertainty is predicted.
Fig. 50 illustrates an example of similarity prediction.
Preferably, all models used in extracting features, or a subset thereof, are integrated in a single (multi-headed) model consisting of one or more shared encoders. The multi-head model may include a segmentation head and/or a depth estimation head and/or an automatic encoder head and/or a video motion recognition head and/or a motion forecast model.
FIG. 51 is an overview of a multi-functional multi-head model.
Fig. 52 illustrates an encoder of the multi-function model. All dashed connections are optional. Time shifting (10), with partial model sampling (12), with continuous video sampling (11), and with a variational reasoning (13) with a risprung Xi Ci constraint (14) have been added, each of which is optional.
Fig. 53 illustrates a segmentation head (i.e., classification module) of the multi-functional model. All dashed connections are optional. Time shifting, li Puxi z constraints and random inactivation for the variance reasoning of all layers of the decoder, and the lipp Xi Ci constraint of all layers of the situation monitor (each optional) have been added. The situation monitor processes the input from the bayesian variance of the variational reasoning (optionally), and/or the softmax value normalized by the Bi Li Puxi z (Bi-Lipschitz) constraint (optionally), and/or the softmax variance normalized by the Bi Li Puxi z constraint (optionally).
Fig. 54 illustrates a video motion recognition head (i.e., classification module) of the multi-function mode. All dashed connections are optional. Time shifting, li Puxi z constraints for the variational reasoning of all layers and random deactivation (each optional) have been added.
Fig. 55 illustrates a depth estimation head (i.e., classification module) of the multi-functional model. All dashed connections are optional. Time shifting, li Puxi z constraints and random inactivation for the variance reasoning of all layers of the decoder, and the lipp Xi Ci constraint (each optional) on all layers of the situation monitor have been added. The situation monitor processes the input from the bayesian variance of the variational reasoning (optionally), and/or the softmax value normalized by the Bi Li Puxi z constraint (optionally), and/or the softmax variance normalized by the Bi Li Puxi z constraint (optionally).
Fig. 56 illustrates an automatic encoder head (i.e., classification module) of a multi-functional model. All dashed connections are optional. Li Puxi z constraints are added to all layers of the decoder and all layers of the situation monitor (each optional). The situation monitor processes the input of pixel-by-pixel reconstruction loss from the auto-encoder head (optionally), and/or the softmax value normalized by the Bi Li Puxi z constraint (optionally), and/or the softmax variance normalized by the Bi Li Puxi z constraint (optionally).
Fig. 57 illustrates an action pre-header (i.e., classification module) of the multifunctional model. All dashed connections are optional. Time shifting, li Puxi z constraints for the variational reasoning of all layers and random deactivation (each optional) have been added.
Fig. 58 illustrates how the system may be implemented following the perception-planning-action concept. The sensing, planning and control subsystems should each have their own hardware. The sensing systems should in particular be integrated with their sensor hardware. The graph should be instantiated on the hardware of the planning subsystem.
These sensors preferably include six vision sensors for close range, each sensor having a 120 degree field of view, each sensor having a 40 degree overlap and each sensor instantiating a separate sensing subsystem; see fig. 59. The sensor preferably comprises a visual sensor with a 60 degree field of view that instantiates a separate perception subsystem of moderate distance.
List of reference numerals
5. Optional connector
6. Optional connector
7. Optional connector
8. Optional connector
9. Optional connector
(10) Time shift
(11) Continuous video sampling
(12) Partial model sampling
(13) Variational reasoning
(14) Li Puxi the constraint of
10. Visual system
10' autonomous device
12. 12' video camera
14. Data processing unit
16. Memory device
18. Communication interface
20. Display unit
22. Output data interface
24. Sensor fusion system
40. 40' segmented neural network
42. Encoder section
44. Decoder section
46. Convolutional layer
48. Pooling layer
50. Feature map
52. Feature score map
54. Section map
54A, 54B section diagrams
56. Segment(s)
58. Segment(s)
60. 60' uncertainty detector
62. Generating a neural network
62' variable automatic encoder
64. Tensor library
66. Image pixel matrix
68. 68' tag
70. Loss function
72. 72' section map
74. 74' uncertainty score map
76. Loss function
78. Loss function
80. Loss function
82. Prediction of deep neural networks
90. Unknown object (case)
92. Known object (street sign)
100. Apparatus and method for controlling the operation of a device
102. Video camera
104. Video streaming
106. Neural engine
108. Object list
110. Output terminal
112. Video output
116. Audio output
120. Segmented neural network
122. Encoder with a plurality of sensors
124. Decoder
126. Frame (B)
128. Frame (B)
130. Feature fusion layer
140. Time shift module
150. Cascade detection concept
152. Condition monitor
154. Condition monitor
156. Effectiveness monitor
158. Effectiveness monitor
Bibliographic list
Smith,Lewis et al.:Can convolutional ResNets approximately preserve input distancesA frequency analysis perspective;17 June 2021
(https://arxiv.org/abs/2106.02469)
Mukhoti;Jishnu et al.:Deterministic Neural Networks with Inductive Biases Capture Epistemic and Aleatoric Uncertainty;8 June 2021
(https://arxiv.org/abs/2102.11582)
An,Shan et al.:Real-Time Monocular Human Depth Estimation and Segmentation on Embedded Systems;24 August 2021
Lin,Ji et al.:TSM:Temporal Shift Module for Efficient Video Understanding;22 August 2019
(https://arxiv.org/abs/1811.08383)
Hurtik,Petr et al.:POLY-YOLO:HIGHER SPEED,MORE PRECISE DETECTION AND INSTANCE SEGMENTATION FOR YOLOV3;29 May 2020
(https://arxiv.org/abs/2005.13243)
Poudel,Rudra PK et al.:Fast-SCNN:Fast Semantic Segmentation Network;12 February 2019
(https://arxiv.org/abs/1902.04502)
Gal,Yarin:Uncertainty in Deep Learning;September 2016
(http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf)
Kendall,Alex et al.:What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?;5 October 2017
(https://arxiv.org/pdf/1511.02680v2.pdf)
Kim,Wonjik et al.:Unsupervised Learning of Image Segmentation Based on Differentiable Feature Clustering;20 July 2020
(https://arxiv.org/pdf/2007.09990.pdf)
Ghahramani,Zoubin:Probabilistic machine learning and artificial intelligence;28 May 2015
(https://www.repository.cam.ac.uk/bitstream/handle/1810/248538/ Ghahramani%202015%20Nature.pdf;jsessionid=0492F830B1DAA2A6A1AE004766B0E0 64sequence=1)
Goodfellow,Ian et al.:Deep Learning,Chapter 6;18 November 2016
(https://www.deeplearningbook.org/contents/mlp.html)
Goodfellow,Ian et al.:Deep Learning,Chapter 5;18 November 2016
(https://www.deeplearningbook.org/contents/ml.html)
Goodfellow,Ian et al.:Deep Learning,Chapter 9;18 November 2016
(https://www.deeplearningbook.org/contents/convnets.html)
Jadon,Shruti:An Overview of Deep Learning Architectures in Few-Shot Learning Domain;19 August 2020
(https://arxiv.org/pdf/2008.06365.pdf)
Li,Shen et al.:PyTorch Distributed:Experiences on Accelerating Data Parallel Training;28 June 2020
(https://arxiv.org/pdf/2006.15704.pdf)
Schnieders,Benjamin et al.:Fully Convolutional One-Shot Object Segmentation for Industrial Robotics;2 March 2019
(https://arxiv.org/pdf/1903.00683.pdf)

Claims (17)

1. A method for generating a new object class, the method comprising the steps of:
-detecting a region in the uncertainty score map consisting of elements with high uncertainty scores, the region representing a first unknown object;
-generating a tentative new object class of said unknown object and automatically generating a tag; and
-detecting a further region in a further uncertainty score map consisting of elements with high uncertainty scores, the region representing a further unknown object;
-determining a similarity between the first and the further;
-if the similarity exceeds a predetermined threshold, generating a new object class that is non-tentative.
2. The method of claim 1, wherein generating a new object class that is non-tentative comprises: single sample learning or small sample learning is performed using samples representing the first unknown object and the other unknown object.
3. A system comprising a main segmentation system and a separate autonomous device,
wherein the primary segmentation system comprises a primary perception system comprising a segmentation neural network (40) configured and trained for segmentation of an input image pixel matrix (66) to generate a segment map (54; 72) composed of elements corresponding to pixels in the input image pixel matrix (66), each element in the segment map (54, 72) being assigned to one of a plurality of object classes, the segmentation neural network (40) being trained for the plurality of object classes by class prediction, elements assigned to the same object class forming segments in the segment map (54, 72), and
Wherein the autonomous device (10 ') comprises a sensor for generating an input image pixel matrix of matrix elements, a segmented neural network, and an uncertainty detector (60 ') configured to generate an uncertainty score map (74) of matrix elements corresponding to pixels in the input image pixel matrix (66), each matrix element in the uncertainty map having an uncertainty score determined by the uncertainty detector (60 ') and reflecting the amount of uncertainty involved in class prediction for the corresponding element in the segment map (72), and
wherein the autonomous device (10 ') further comprises signaling means configured to generate a user perceivable signal in case the uncertainty map generated by the uncertainty detector (60 ') of the autonomous device (10 ') comprises an area consisting of matrix elements exhibiting an uncertainty score above a threshold value and thereby representing edge conditions of the object classification and image segmentation.
4. An autonomous device (10 ') comprising a sensor for generating an input image pixel matrix of matrix elements, a segmented neural network, and an uncertainty detector (60 ') configured to generate an uncertainty score (74) of matrix elements corresponding to pixels in the input image pixel matrix (66), each matrix element in the uncertainty score having an uncertainty score determined by the uncertainty detector (60 ') and reflecting an amount of uncertainty involved in class prediction for a corresponding element in the segment map (72), and
the autonomous device (10 ') further comprises signaling means configured to generate a user perceivable signal in case the uncertainty map generated by the uncertainty detector (60 ') of the autonomous device (10 ') comprises an area consisting of matrix elements exhibiting an uncertainty score above a threshold value and thereby representing edge conditions of the object classification and image segmentation.
5. The autonomous device of claim 4, further comprising a new object detector operatively connected to the uncertainty detector of the autonomous device and configured to find regions in the uncertainty score graph comprised of elements having high uncertainty scores and to generate new object classes for such found regions.
6. The autonomous device of claim 4 or 5, wherein the autonomous device is configured to exchange tagged data sets comprising segments assigned to recently generated object classes with other autonomous devices, thereby increasing the number of known object classes available for semantic segmentation by the autonomous device.
7. The autonomous device of at least one of claims 4 to 6, wherein the autonomous device is configured to record a response in response to a warning signal transmitted by the autonomous device.
8. The autonomous device of at least one of claims 4 to 7, wherein said autonomous device is configured to learn to discern unknown objects by automatically generating new object classes.
9. The autonomous device of claim 8, wherein the autonomous device is configured to automatically generate tags for new object classes based on a user's reaction in the context of encountering an object that is not yet known.
10. The autonomous device of at least one of claims 4 to 9, wherein said autonomous device is a pocket-type mobile device that can be easily carried by a user and easily mounted to, for example, a windshield of a vehicle.
11. The autonomous device of at least one of claims 4 to 10, further comprising a segmentation neural network implementing a semantic segmentation model, the segmentation neural network being preferably continuously trained with the output of the new object detector as input, thereby enabling the neural network to predict new previously encountered objects.
12. A system, the system comprising: an input for a sequence of input image pixel matrices (66) obtained from an input video stream; and a perception system comprising a segmented neural network (40) configured and trained for segmentation of an input image pixel matrix (66) to generate a segment map (54; 72) composed of elements corresponding to pixels in the input image pixel matrix (66), each element in the segment map (54, 72) being assigned to one of a plurality of object classes, the segmented neural network (40) being trained for the plurality of object classes by class prediction, elements assigned to the same object class forming segments in the segment map (54, 72),
The system is configured to generate a list of objects from the segments in the segment map, the list of objects being encoded by polygons around the respective objects and annotated with the class of objects, the polygons characterizing the position of the corresponding segments relative to the input image pixel matrix.
13. The system of claim 12, wherein the system comprises an interface for forwarding the list of objects to another device.
14. The system of claim 12 or 13, wherein the perception system implements a video motion recognition model that combines a multi-path segmentation encoder (122) with a Poly-Yolo header (132), the video motion recognition model comprising a time shift module (140).
15. The system of claim 14, wherein the perception system implements a video motion recognition model that includes a feature fusion module (130) whose inputs come from a layer of a downsampling module (126) of the encoder (122) of the video motion recognition model and/or a feature extraction module (128).
16. The system of at least one of claims 12 to 15, configured to forecast actions performed by a person or a vehicle, receive an input video stream from a camera, and send a list of objects on a smartphone or device over USB, wi-Fi, bluetooth, ethernet, the list of objects being encoded with polygons around the objects and annotated with classes of expected actions.
17. The system of claim 16, implementing an action prediction model that combines a multi-path segmentation encoder with a Poly-Yolo classification module of the decoder, also implementing time shifting and/or variational reasoning with partial model sampling and/or with continuous video sampling and/or Li Puxi z constraints.
CN202280050415.2A 2021-05-17 2022-05-17 System for detecting and managing uncertainty, new object detection and situation expectations in a perception system Pending CN117716395A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
EP21174146.7 2021-05-17
EP21204038.0 2021-10-21
EP21204038 2021-10-21
PCT/EP2022/063359 WO2022243337A2 (en) 2021-05-17 2022-05-17 System for detection and management of uncertainty in perception systems, for new object detection and for situation anticipation

Publications (1)

Publication Number Publication Date
CN117716395A true CN117716395A (en) 2024-03-15

Family

ID=78371875

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280050415.2A Pending CN117716395A (en) 2021-05-17 2022-05-17 System for detecting and managing uncertainty, new object detection and situation expectations in a perception system

Country Status (1)

Country Link
CN (1) CN117716395A (en)

Similar Documents

Publication Publication Date Title
Bachute et al. Autonomous driving architectures: insights of machine learning and deep learning algorithms
Bila et al. Vehicles of the future: A survey of research on safety issues
US10657391B2 (en) Systems and methods for image-based free space detection
US10740658B2 (en) Object recognition and classification using multiple sensor modalities
JP7199545B2 (en) A Multi-view System and Method for Action Policy Selection by Autonomous Agents
US20230386167A1 (en) System for detection and management of uncertainty in perception systems
US10964033B2 (en) Decoupled motion models for object tracking
Itkina et al. Dynamic environment prediction in urban scenes using recurrent representation learning
Dequaire et al. Deep tracking on the move: Learning to track the world from a moving vehicle using recurrent neural networks
Kolekar et al. Behavior prediction of traffic actors for intelligent vehicle using artificial intelligence techniques: A review
US20230230484A1 (en) Methods for spatio-temporal scene-graph embedding for autonomous vehicle applications
CN111532225A (en) Vehicle capsule network
Iqbal et al. Autonomous Parking-Lots Detection with Multi-Sensor Data Fusion Using Machine Deep Learning Techniques.
Kuhn et al. Introspective failure prediction for autonomous driving using late fusion of state and camera information
EP4064127A1 (en) Methods and electronic devices for detecting objects in surroundings of a self-driving car
Liu et al. Deep transfer learning for intelligent vehicle perception: A survey
EP4145398A1 (en) Systems and methods for vehicle camera obstruction detection
US20230260259A1 (en) Method and device for training a neural network
CN117716395A (en) System for detecting and managing uncertainty, new object detection and situation expectations in a perception system
Ding et al. EgoSpeed-net: Forecasting speed-control in driver behavior from egocentric video data
EP4341913A2 (en) System for detection and management of uncertainty in perception systems, for new object detection and for situation anticipation
EP4050510A1 (en) Object information calculation method and system
Ke Real-time video analytics empowered by machine learning and edge computing for smart transportation applications
US20230227073A1 (en) Vehicular autonomous control system based on learned and predicted vehicle motion
US20230267749A1 (en) System and method of segmenting free space based on electromagnetic waves

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication