WO2019197021A1

WO2019197021A1 - Device and method for instance-level segmentation of an image

Info

Publication number: WO2019197021A1
Application number: PCT/EP2018/059130
Authority: WO
Inventors: Ibrahim HALFAOUI; Onay URFALIOGLU; Fahd BOUZARAA
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2018-04-10
Filing date: 2018-04-10
Publication date: 2019-10-17
Also published as: CN111886600A

Abstract

A device and method for performing instance-level semantic segmentation of an image are proposed. Thereby, a class-level semantic segmentation is combined with an instance-level boundary detection, and a modified SLIC algorithm calculates a plurality of superpixels as instance-level segments. The device performs the class-level semantic segmentation of the image to obtain one or more class-level segments, each class-level segment having an object class associated with it, and it performs an instance-level semantic boundary detection on the image to obtain one or more instance-level boundaries and for each instance-level boundary an instance-level center point. The device estimates, for each class-level segment, a number of object instances in the class-level segment based on the number of instance-level center points located in the class-level segment. For each class- level segment having an estimated number of object instances greater than one, the device performs the modified SLIC algorithm based on the one or more instance-level boundaries to obtain a plurality of superpixels as instance-level segments.

Description

DEVICE AND METHOD FOR INSTANCE-LEVEL SEGMENTATION OF AN

IMAGE

TECHNICAL FIELD

The present invention relates to a device for instance-level segmentation of an image, and to a corresponding instance-level semantic segmentation method. Instance-level semantic segmentation can be used to segment and select every semantically relevant object inside an image of a scene. This implies that objects belonging to the same class are segmented and identified separately.

BACKGROUND

The recent rise of interest around artificial intelligence, such as autonomous driving or robot navigation, resulted in new research topics which are crucial to these technologies. For instance, understanding and analyzing the scene surrounding a robot or an autonomous vehicle is a key component for the related application. This task involves the ability to detect and extract semantically relevant objects inside a scene, e.g., in an image of the scene. This procedure is known as“Image Segmentation”.

A notable approach for image segmentation is the Simple Linear Iterative Clustering (SLIC) algorithm. In this algorithm an image of the scene is clustered into superpixels, wherein a superpixel is a group of semantically coherent pixels. This means that pixels belonging to a specific superpixel might belong to the same scene object. The smallest possible superpixel consists of a single pixel, and the largest possible superpixel is composed of all pixels of the image. The SLIC approach is based on a modified version of the k-means clustering algorithm, so that superpixel boundaries overlap with boundaries/edges of semantically relevant objects. Few modified versions of the SLIC algorithm have been proposed. For instance, in one proposal the SLIC algorithm is modified, in order to be utilized with a Nystrom based spectral clustering method for segmenting image regions instead of pixels. In another proposal, two different modified SLIC versions are introduced: the first one is a SLIC version suitable to high dimensional cases, and the second one is a version with an enhanced distance measure that is more complex and robust, called the fractional distance.

The SNIC algorithm is a further modified version of the SLIC algorithm, and was introduced as a non-iterative alternative that enforces connectivity from the start, requires less memory and is faster and simpler than the original version. Recently, an extended version of the SLIC algorithm, called BSLIC algorithm, was proposed and aims at incorporating boundary terms coming from usual edge detection into the distance measure.

In addition to image segmentation, which involves the detection and extraction of semantically relevant objects in a scene, there is also a need for“identifying” these objects, for example, by assigning them to a correct class (e.g., cars, sky, vegetation...). This procedure is called“Semantic Segmentation”.

Semantic segmentation is a well-researched topic, which falls under the scope of classification tasks. This explains why semantic segmentation strongly benefited from the recent surge of deep learning approaches such as Deep Neural Networks (DNN), Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). The majority of semantic segmentation algorithms aim at extracting semantically relevant image segments, and assigning each of these segments to the corresponding class. This means that the semantic segmentation algorithm disposes of a pre-defmed number of known classes. Its task is therefore to classify each object in the scene.

For example, in the context of a driving scenario, the object classes represent the scene objects and parts relevant to the application. These classes may include classes such as “cars”,“vegetation”,“pedestrians”,“road-signs”, etc. Accordingly, all cars in the scene will be assigned to the corresponding class (“car”).

The next level of scene understanding involves the ability of an autonomous vehicle or robot to perform the semantic segmentation of the current scene on an instance-level, i.e. to perform“Instance-Level Semantic Segmentation”. This means that in addition to assigning objects to the correct class, the autonomous system should be able to identify and label every instance of an object of a given class. For example, in addition to segmenting areas in the image belonging to the class“cars”, the autonomous system in this case is also able to identify every car in the image, and to assign a unique label to each car.

There are several approaches for instance-level semantic segmentation, mostly based on deep learning (CNN). These approaches typically propose candidate objects, usually as bounding boxes, and directly predict a binary mask within each such candidate object proposal. As a consequence, these approaches cannot recover from errors in the object candidate generation process, such as too small or shifted boxes. Further, the approaches still lack a proper integration of semantic information, and suffer of high complexity.

An additional application related to instance-level semantic segmentation is the detection of the boundaries (i.e. edges) of semantically relevant objects in the image of a scene. This task is also known as semantic edge detection, boundary detection or contour detection. Accordingly, the outcome of a semantic boundaries detection system will differ from a traditional edge detection algorithm, as the goal in this case is to highlight and detect the boundaries of semantically relevant objects and not all edges in the scene.

Similar to semantic segmentation and instance-level semantic segmentation, the detection of boundaries of semantically relevant objects can be performed either on a class-level (non-instance-level boundaries detection) or on an instance-level (instance-level boundaries detection).

Boundaries detection is a challenging task, especially in the context of systems operating on an instance level. In this case, the enabling algorithm is required to be able to identify boundaries separating objects from the same class in a specific (multi-instance) segment and edges corresponding to the texture in that segment.

In summary, up to date there is no instance-level semantic segmentation approach which can recover well from errors in the object candidate generation process, and which further allows a proper integration of semantic information and is additionally of low complexity. SUMMARY

It was realized by the inventors that instance-level semantic segmentation can be done, at least in principle, by combining semantic segmentation and instance-level boundaries detection in a single framework. This involves separating instances of the same object using the semantic boundaries. It also requires a highly accurate semantic segmentation, and at the same time an accurate boundaries detection algorithm. Consequently, inaccuracy in terms of semantic segmentation and/or boundaries detection could result in a significant decrease of the quality of the targeted instance-level semantic segmentation.

Furthermore, instance-level boundaries detection is a topic which is less researched than semantic segmentation (class- and instance level). Existing approaches mostly rely on deep learning frameworks. This implies that most of these approaches suffer from performance instability, especially when presented with images of scenes that are different from the ones used to train the corresponding model.

In view of the above-mentioned challenges and disadvantages, the invention aims to improve the existing approaches for instance-level semantic segmentation. The invention has particularly the objective to provide a device and method for performing an improved instance-level semantic segmentation of an image. The invention aims in particular for a solution allowing a better recovery from errors. The invention also desires a solution enabling a better integration of semantic information. Finally, the invention strives for low complexity. The objective of the invention is achieved by the solution provided in the enclosed independent claims. Advantageous implementations of the invention are further defined in the dependent claims.

For describing the solution of the invention, the following terms are used in this document, and are to be understood as explained below.

“Image” - A visual representation of a real world or synthetic scene by a digital camera. Also referred to as a picture. “Pixel” - The smallest addressable picture/image element.

“Scene” - The surrounding environment with respect to a reference. For example, the scene of a camera is the part of the environment, which is visible from the camera.

“Texture” - An area within an image, which depicts content having a significant variation in the color intensities.

“Edges” - Areas in an image where the gradient (brightness level) changes abruptly.

“Superpixel” - A set of image pixels (composed of one or more pixels), which are ideally semantically similar (e.g., have similar colors).

“Class” - A defined semantic group, which incorporates several objects having similar semantic features. For example, trees, flowers and other types of plants all belong to the class‘vegetation’.

“Segment” - A set of image pixels (composed of one or more pixels), which ideally belong to a common semantic class.

“Instance” - An object which belongs to a known class. For example, a tree in a scene is an instance, which belongs to the‘vegetation’ class.

“Multi-Instance Segment” - A segment containing multiple instances (objects) of the same class it was assigned to.

“Label” - An identifier (e.g., an integer) to determine the class type of an item/entity.

“Image Segmentation” - A method to partition the image into semantically coherent segments, without any prior knowledge on the class to which these segments belong to. “Semantic Segmentation” - A method to segment an image into different regions according to a semantic belonging. For example, pixels depicting a car are all in red color, pixels depicting the road are all in blue color, etc.

“Instance-level Semantic Segmentation” - A method to segment an image into different regions and object instances according to a semantic belonging. Single objects are identified and are separable from each other.

“Semantic Boundaries” - Edges separating semantically relevant classes. This implies that image texture and details, which are not part of the boundaries between classes, are not considered semantic boundaries.

“Instance-level Semantic Boundaries” - Semantic boundaries applied on an instance level (edges separating classes from each other, as well as instances inside a multi-instance segment).

“Machine Learning” - Field of research which focuses on analyzing and learning from input data for the purpose of building a model capable of making predictions.

“Artificial Neural Network” - A machine learning subfield in the context of deep learning, motivated by biological neural networks. Artificial networks aim at estimating functions with large number of inputs, by adaptively learning a set of connection weights.

“Convolution” - Mathematical operation which computes the amount of overlap of two functions, one of them being reversed and shifted, using integral computation.

“Convolutional Neural Network” - An artificial neural network which contains at least one convolutional layer in its architecture.

A first aspect of the invention provides a device for instance-level semantic segmentation of an image, the device being configured to perform a class-level semantic segmentation of the image to obtain one or more class-level segments, each class-level segment having an object class associated with it, perform an instance-level semantic boundary detection on the image to obtain one or more instance-level boundaries and, for each instance-level boundary, an instance-level center point, estimate, for each class-level segment, a number of object instances in the class-level segment based on the number of instance-level center points located in the class-level segment, and perform, for each class-level segment having an estimated number of object instances greater than one, a modified SLIC algorithm based on the one or more instance-level boundaries to obtain a plurality of superpixels as instance-level segments.

Each center point may, for example, be the centroid of the instance-level segment that is delimited by the respective instance-level boundary.

The device allows performing instance-level semantic segmentation of an image that is both accurate and of low computational complexity. In particular, the combination of class- level semantic segmentation and instance-level boundary detection leads to results with low computational complexity. The subsequently performed modified SLIC algorithm allows refining these results into precise instance-level semantic segments with relatively low additional complexity. Further, the instance-level segmentation procedure implemented by the device can recover well from errors in the class-level semantic segmentation and/or in the instance-level boundary detection, due to the modified SLIC algorithm. Further, good integration of semantic information is achieved.

In an implementation form of the first aspect, the device is configured to perform, for a given class-level segment, the modified SLIC algorithm based on the estimated number of object instances in the segment, so as to initialize a number of search regions, each around a center pixel, corresponding to the estimated number of object instances in the segment.

This makes the instance-level semantic segmentation performed by the device even more efficient, and leads to more precise results.

In a further implementation form of the first aspect, the modified SLIC algorithm is based on the SLIC algorithm modified to consider a number of boundary pixels, calculated from the one or more instance-level boundaries, in each search region. By taking into account the boundaries by means of the boundary pixels, the modified SLIC algorithm yields superpixels more precisely defining instance-level segments.

In a further implementation form of the first aspect, the device is configured to assign, by performing the modified SLIC algorithm, a search pixel to a superpixel whose search region includes the smallest number of boundary pixels, calculated from the one or more instance-level boundaries separating the search pixel from the center pixel of the search region.

The fewer boundary pixels separate the search pixel from the center pixel, the higher the probability that the search pixel belongs to the same object instance as the center pixel. Accordingly, more precise instance-level semantic segments are obtained.

In a further implementation form of the first aspect, the device is configured to compute, by performing the modified SLIC algorithm, a distance measure for a search pixel in a search region with respect to a center pixel of the search region, wherein the distance measure is penalized according to the number of boundary pixels, calculated from the one or more instance-level boundaries, in the search region.

In a further implementation form of the first aspect, the device is configured to compute the distance measure D according to

wherein dc represents a color distance in the CIELAB domain of the search pixel to the center pixel, ds represents a distance measure in the spatial domain of the search pixel to the center pixel, m is a first weighting parameter, m is the number of boundary pixels in the search region, 4 S² is the total number of pixels in the search region, and oc is a second weighting parameter.

The distance measure computed in the modified SLIC algorithm yields accurate results, while adding little computational complexity. In a further implementation form of the first aspect, the device comprises a CNN for performing the instance-level semantic boundary detection.

This provides a particular efficient implementation of the instance-level semantic boundary detection.

In a further implementation form of the first aspect, the device comprises a cascade of CNN subnetworks and configured to operate a first subnetwork of the cascade to obtain the one or more class-level boundaries, and operate a second subnetwork of the cascade to obtain, for each of the one or more class-level boundaries, one or more instance-level boundaries based on the class-level boundary obtained by the first subnetwork.

In a further implementation form of the first aspect, the device is further configured to operate the second subnetwork of the cascade to obtain the one or more instance-level center points based on the one or more instance-level boundaries obtained by the second subnetwork.

The cascade of CNN subnetworks delivers accurate results with high efficiency, and benefits from deep learning.

In a further implementation form of the first aspect, the device is configured to, for estimating the number of object instances in each class-level segment, determine whether the class-level segment contains a single object instance or multiple object instances, and estimate, for a class-level segment containing multiple object instances, the number of object instances based on the one or more instance-level center points.

This allows the device to perform the modified SLIC algorithm even more efficiently.

In a further implementation form of the first aspect, the device comprises a CNN for performing the class-level semantic segmentation of the image.

A second aspect of the invention provides a method for instance-level semantic segmentation of an image, the method comprising performing a class-level semantic segmentation of the image to obtain one or more class-level segments, each class-level segment having an object class associated with it, performing an instance-level semantic boundary detection on the image to obtain one or more instance-level boundaries and, for each instance-level boundary, an instance-level center point, estimating, for each class- level segment, a number of object instances in the class-level segment, based on the number of instance-level center points located in the class-level segment, and performing, for each class-level segment having an estimated number of object instances greater than one, a modified Simple Linear Iterative Clustering, SLIC, algorithm based on the one or more instance-level boundaries to obtain a plurality of superpixels as instance-level segments.

In an implementation form of the second aspect, the method comprises performing, for a given class-level segment, the modified SLIC algorithm based on the estimated number of object instances in the segment, so as to initialize a number of search regions, each around a center pixel, corresponding to the estimated number of object instances in the segment.

In a further implementation form of the second aspect, the modified SLIC algorithm is based on the SLIC algorithm modified to consider a number of boundary pixels, calculated from the one or more instance-level boundaries, in each search region.

In a further implementation form of the second aspect, the method comprises assigning, by performing the modified SLIC algorithm, a search pixel to a superpixel whose search region includes the smallest number of boundary pixels, calculated from the one or more instance-level boundaries separating the search pixel from the center pixel of the search region.

In a further implementation form of the second aspect, the method comprises computing, by performing the modified SLIC algorithm, a distance measure for a search pixel in a search region with respect to a center pixel of the search region, wherein the distance measure is penalized according to the number of boundary pixels, calculated from the one or more instance-level boundaries, in the search region.

In a further implementation form of the second aspect, the method comprises computing the distance measure D according to

In a further implementation form of the second aspect, the method comprises performing the instance-level semantic boundary detection by means of a CNN.

In a further implementation form of the second aspect, the method comprises operating a first subnetwork of a cascade of CNN subnetworks to obtain the one or more class-level boundaries, and operating a second subnetwork of the cascade to obtain, for each of the one or more class-level boundaries, one or more instance-level boundaries based on the class-level boundary obtained by the first subnetwork.

In a further implementation form of the second aspect, the method comprises operating the second subnetwork of the cascade to obtain the one or more instance-level center points based on the one or more instance-level boundaries obtained by the second subnetwork.

In a further implementation form of the second aspect, the method comprises, for estimating the number of object instances in each class-level segment, determining whether the class-level segment contains a single object instance or multiple object instances, and estimating, for a class-level segment containing multiple object instances, the number of object instances based on the one or more instance-level center points.

In a further implementation form of the second aspect, the method comprises performing the class-level semantic segmentation of the image with a CNN. The method of the second aspect and its implementation forms achieve the same advantages and effects as the device of the first aspect and its respective implementation forms.

A third aspect of the invention provides a computer program product comprising a program code for controlling the device according to the first aspect or any of its implementation forms, or for carrying out, when implemented on a processor, the method according to the second aspect or its implementation forms.

Accordingly, the computer program product of the third aspect is able to deliver the same advantages and effects as described above for the device of the first aspect and the method of the second aspect, respectively.

It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF THE DRAWINGS

The above described aspects and implementation forms of the invention will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which

FIG. 1 shows a device according to an embodiment of the invention. FIG. 2 shows a device according to an embodiment of the invention.

FIG. 3 shows a CNN architecture for a device according to an embodiment of the invention.

FIG. 4 shows an exemplary image, in which two superpixels and search windows are illustrated. One search window corresponding to a superpixel includes more boundary pixels in comparison to the other search window.

FIG. 5 shows a method according to an embodiment of the invention.

FIG. 6 and FIG. 7 illustrate an example of class-level semantic boundaries (see FIG. 6) and instance-level semantic boundaries (see FIG. 7) in an image. DETAILED DESCRIPTION OF THE EMBODIMENTS

FIG. 1 shows a device 100 according to an embodiment of the invention. The device 100 is configured to perform instance-level semantic segmentation of an image 101. To this end, the device 100 may comprise processing circuitry, like at least one processor, and/or may comprise one or more CNNs and/or subnetworks, in order to perform at least one of the following functions.

The device 100 is configured to perform a class-level semantic segmentation 103 (schematically indicated by a box in FIG. 1) of the image 101, in order to obtain one or more class-level segments 106. Each class-level segment 106 has an object class associated with it.

The device 100 is also configured to perform an instance-level semantic boundary detection 102 (schematically indicated by a box in FIG. 1) on the image 101, in order to obtain one or more instance-level boundaries 108, and to obtain for each instance-level boundary 108 an instance-level center point 107, e.g., a centroid. The instance-level semantic boundary detection 102 thus yields all estimated of instance-level center points 107 in the image 101. The device 100 may perform the class-level segmentation 103 and the instance-level boundary detection 102 either in parallel, or partly in parallel, or one after the other.

The device 100 is further configured to estimate 104 (schematically indicated by a box in FIG. 1), for each class-level segment 106 provided by the class-level semantic segmentation 103, a number of object instances in the class-level segment 106 based on the number of instance-level center points 107 located in the class-level segment 106. This yields, in particular, all class-level segments 109 that have an estimated number of object instances greater than one.

Further, the device 100 is configured to perform, for each class-level segment 109 having an estimated number of object instances greater than one, a modified SLIC algorithm 105 (schematically indicated by a box in FIG. 1) based on the one or more instance-level boundaries 108, in order to obtain a plurality of superpixels as instance-level segments 110. By obtaining these instance-level segments 110, the device 100 has successfully performed the instance-level semantic segmentation of the image 101.

In other words, the proposed technique may start with performing the class-level semantic segmentation 103 of the image 101 under consideration, and with estimating the instance level-semantic boundaries 108. The instance-level semantic boundary detection 102 provides an estimation of boundaries 108 of semantically relevant objects, such as cars, pedestrians and various other classes, on an instance level, as well as an estimate of the center points 107 of all object instances in the image 101.

The estimated semantic boundaries 108 may not be accurate enough to be directly used for estimating the desired instance-level semantic segmentation of the image 101. To deal with this limitation, the technique makes advantage of the modified SLIC algorithm 105. Firstly, the device 100 is to this end configured to use the estimated center points 107 (e.g., 2D centroids) of all instances, in order to decide for each class-level segment 106 in the image 101 whether the respective segment 106 contains multiple instances of the same class or not. To this end, the device 100 may be configured to estimate the number of instances inside each class-level segment 106. The modified SLIC algorithm 105 is then applied inside each class-level segment 109 containing several instances. The modification of the SLIC algorithm may specifically comprise integrating the information about the semantic boundaries 108 into a distance measure between each search pixel and the center of a superpixel under consideration (explained further below in detail). The modified SLIC algorithm 105 may also use the number 201 of instances (see FIG. 2) estimated previously as additional input. These steps may be done separately for each multi-instance class-level segment 109 inside the image 101.

FIG. 2 shows a device 100 according to an embodiment of the invention, which builds on the device 100 shown in FIG. 1. Same functions and computed/estimated elements are labelled with the same reference signs. The device 100 shown in FIG. 2 may notably be implemented in, or may be used for, an autonomous vehicle or robot.

The device 100 of FIG. 2 may be configured to capture (or receive from an external capturing device) one or more images 101 of the scene surrounding the autonomous vehicle or robot. The device 100 may perform its instance-level semantic segmentation on RGB and/or grayscale images 101. It is also possible that the device 100 benefits from different capturing setups, such as stereo or camera-arrays, but in the following the device 100 is described regarding a single texture image 101 as input.

In the following, different functional blocks of the device 100 are described. These functional blocks may be carried out in either a single hardware element, like a processing circuitry or a processor, or may be carried out in different hardware elements, like multiple processors and/or CN s.

In a functional block of the device 100, an instance-level estimation 102 of the object boundaries 108 is performed on the previously captured texture image 101. For this, a conventional approach may be used. For instance, an approach as described in“Kokkinos, I, Pushing the boundaries of boundary detection using deep learning, ICLR (2016)”, Xie, S., Tu, Z.,“ Holistically-nested edge detection”, ICCV (2015)”, Bertasius, G., Shi, L, Torresani, L., Deepedge: A multi-scale bifurcated deep network for top-down contour detection, CVPR (2015)” or“Maninis, K. K., Pont-Tuset, J., Arbelaez, P. and Van Gool, L. “ Convolutional Oriented Boundaries”, ECCV 2016, may be used.

As an advantageous alternative to a conventional approach, however, the device 100 may comprise and use a CNN 300 as shown in FIG. 3. In such a CNN-based device 100, the instance-level boundaries detection 102 may be performed using the CNN 300. The CNN 300 may include a cascade of CNN subnetworks 301, 302. As shown in FIG. 3, an end-to- end mapping may be split into two sub-networks 301, 302. A first subnetwork 301 may be trained and operated to generate class-level semantic boundaries 303 (according to the classes defined in the training set). A second subnetwork 302 may be trained and operated and thereby leam a mapping between a concatenation of the input image 101 and the output of the first subnetwork 301, and thus the instance-level semantic boundaries 108 of the image 101. In addition, the second subnetwork 302 may be trained and operated to provide the center points 107 (e.g., 2D centroids) of the object instances in the image 101. The information may then be used for applying the subsequent modified STIC algorithm 105.

In another functional block of the device 100, a semantic segmentation 103 of the input image 101 of the scene under consideration is performed. At this level, no instance-level semantic segmentation is required. A conventional class-level semantic segmentation algorithm may be used, for example, a conventional STIC, SNIC or BSFIC algorithm.

In another functional block of the device 100, a number 201 of object instances in each segment 106 may be estimated. For example, by using the input image 101, the output (estimated center points 107 in the image 101) of the instance-level semantic boundaries detection 102 and the output (estimated class-level segments 106 in the image 101) of the class-level semantic segmentation 103, two pieces of information are extracted for each class-level segment 106 in the image 101 : First, it is checked, whether the class-level segment 106 under consideration contains multiple instances of the same object (i.e. is a multi-instance segment 109) or only a single object of the corresponding class (i.e. is a single-instance segment). This may be done by checking whether the class-level segment 106 under consideration contains more than one center point 107 or not. Next, in case of a multi-instance segment 109, the number 201 of objects (object instances) inside this specific multi-instance segment 109 can be estimated. This estimation may be done by counting a number of the center points 107 (e.g., 2D centroid estimates) inside the multi instance segment 109. This may be done for each class-level segment 106 in the image 101. The number 201 of instances inside the segment 109 may then be used for the subsequent performing of the modified SLIC algorithm 105.

In another functional block of the device 100, the modified SLIC algorithm 105 is applied. In particular, a superpixel segmentation for each multi-instance segment 109 is carried out by this modified SLIC algorithm 105. This allows separating each object instance. The modification of the conventional SLIC algorithm evolves around the integration of the instance-level semantic boundaries 108, in order to improve the performance of the conventional SLIC superpixels segmentation. The modified SLIC algorithm 105 will preferably use the previously estimated number 201 of instances inside the multi-instance segment 109 as an input.

Notably, during a conventional SLIC superpixel segmentation (conventional SLIC algorithm), each pixel in an image or image segment is compared against the centers of superpixels, whose search regions include the pixel under investigation. This comparison is based on a distance measure D for a search pixel in a search region with respect to a center pixel of the search region, and is usually expressed as

D = Jd + ( )²m² dc represents the color distance (in the CIELAB domain) of the search pixel to the center pixel, ds represents a distance measure in the spatial domain of the search pixel. 5 represents the initial sampling interval, and m is a weighing parameter, which allows weighing the contribution of the spatial distance measure in comparison to the color distance. The search window for probable new pixels for each superpixel is 25x25', i.e. 45² is the total number of pixels in the search region.

Accordingly, the pixel under consideration will be assigned to the superpixel where the distance D between the search pixel and the cluster center pixel is the smallest. The modified SLIC algorithm 105 is similar to this SLIC algorithm. However, in the modified SLIC algorithm 105 performed by the device 100, in addition to the above-described color distance dc and spatial distance measure ds, also the information about the determined instance-level semantic boundaries 108 is integrated into the decision making. This is explained with respect to FIG. 4, which shows an exemplary image 101 under consideration, in which exemplarily two superpixels (particularly their centers 402) and two corresponding search regions 401 (search windows of the superpixels) around the center pixels 402 are illustrated. Of note, one search region 401 corresponding to a first superpixel 1 includes more boundary pixels 403 than another search region 401 corresponding to a second superpixel 2.

The device 100 may be configured to compute the number of boundary pixels 403 inside a search region 401. The number of boundary pixels 403 may then be integrated into the distance measure (referred to as Dnew for the modified SLIC algorithm 105). The integration is specifically such that Dnew increases when more boundary pixels 403 are detected. In this way, a pixel 404 under consideration, as shown in FIG. 4, will be assigned to the superpixel, where less boundary pixels 403 separate the search pixel 404 and the superpixel center 402.

Therefore, the device 100 is configured to calculate, for each search region 401, the number of boundary pixels according to the information provided by the instance-level semantic boundary detection 102. Then, it may normalize the number of boundary pixels 403 by the total number of pixels inside the search region 401 (4 S²). Finally, it may include this into the equation, in order to obtain the distance measure Dnew according to

In the above formula, a second weighing parameter a weights the contribution of the introduced boundary distance to the overall distance measure. Similar as described above, dc represents a color distance in the CIELAB domain of the search pixel 404 to the center pixel 402, ds represents a distance measure in the spatial domain of the search pixel 404 to the center pixel 402, m is a first weighting parameter, m is the number of boundary pixels 403 in the search region 401, and 4 S² is the total number of pixels in the search region 401. After applying the modified version of the SLIC algorithm 105 on each multi-instance segment 109, the device 100 is configured to output instance-level semantic segments 110 defining the desired instance-level semantic segmentation of the input image 101. FIG. 5 shows a method 500 according to an embodiment of the invention. The method 500 is particularly for instance-level semantic segmentation of an image 101. The method 500 comprises a step 501 of performing 501 a class-level semantic segmentation 103 of the image 101 to obtain one or more class-level segments 106, each class-level segment 106 having an object class associated with it. The method 500 also comprises a step 502 of performing an instance-level semantic boundary detection 103 on the image 101 to obtain one or more instance-level boundaries 108 and, for each instance-level boundary 108, an instance-level center point 107. Further, the method 500 comprises estimating 503, for each class-level segment 106, a number of object instances in the class-level segment 106, based on the number of instance-level center points 107 located in the class-level segment 106. In addition, the method 500 comprises a step 504 of performing, for each class-level segment having an estimated number of object instances greater than one, a modified SLIC, algorithm 105 based on the one or more instance-level boundaries 108 to obtain a plurality of superpixels as instance-level segments 110. The invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word“comprising” does not exclude other elements or steps and the indefinite article“a” or“an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. Device (100) for instance-level semantic segmentation of an image (101), the device (100) being configured to

perform a class-level semantic segmentation (103) of the image (101) to obtain one or more class-level segments (106), each class-level segment (106) having an object class associated with it,

perform an instance-level semantic boundary detection (102) on the image (101) to obtain one or more instance-level boundaries (108) and for each instance-level boundary (108) an instance-level center point (107),

estimate (104), for each class-level segment (106), a number of object instances in the class-level segment (106) based on the number of instance-level center points (107) located in the class-level segment (106), and

perform, for each class-level segment (109) having an estimated number of object instances greater than one, a modified Simple Linear Iterative Clustering, SLIC, algorithm (105) based on the one or more instance-level boundaries (108) to obtain a plurality of superpixels as instance-level segments (110).

2. Device (100) according to claim 1, configured to

perform, for a given class-level segment (106), the modified SLIC algorithm (105) based on the estimated number (201) of object instances in the segment (106), so as to initialize a number of search regions (401), each around a center pixel (402), corresponding to the estimated number (201) of object instances in the segment (106).

3. Device (100) according to claim 1 or 2, wherein

the modified SLIC algorithm (105) is based on the SLIC algorithm modified to consider a number of boundary pixels (403), calculated from the one or more instance-level boundaries (108), in each search region (401).

4. Device (100) according to any one of claims 1 to 3, configured to

assign, by performing the modified SLIC algorithm (105), a search pixel (404) to a superpixel whose search region (401) includes the smallest number of boundary pixels (403), calculated from the one or more instance-level boundaries (108) separating the search pixel (404) from the center pixel (402) of the search region (401).

5. Device (100) according to any one of claims 1 to 4, configured to

compute, by performing the modified SLIC algorithm (105), a distance measure for a search pixel (404) in a search region (401) with respect to a center pixel (402) of the search region (401), wherein the distance measure is penalized according to the number of boundary pixels (403), calculated from the one or more instance-level boundaries (108), in the search region (401).

6. Device (100) according to claim 5, configured to

compute the distance measure D according to

wherein dc represents a color distance in the CIELAB domain of the search pixel (404) to the center pixel (402), ds represents a distance measure in the spatial domain of the search pixel (404) to the center pixel (402), m is a first weighting parameter, m is the number of boundary pixels (403) in the search region (401), 4 S² is the total number of pixels in the search region (401), and oc is a second weighting parameter.

7. Device (100) according to any one of claims 1 to 6, comprising

a Convolutional Neural Network, CNN, (300) for performing the instance-level semantic boundary detection (102).

8. Device according to claim 7, comprising a cascade of CNN subnetworks (301, 302) and configured to

operate a first subnetwork (301) of the cascade to obtain one or more class-level boundaries (303), and

operate a second subnetwork (302) of the cascade to obtain, for each of the one or more class-level boundaries (303), one or more instance-level boundaries (108) based on the class-level boundary (303) obtained by the first subnetwork (301).

9. Device (100) according to claim 8, further configured to

operate the second subnetwork (302) of the cascade to obtain the one or more instance-level center points (107) based on the one or more instance-level boundaries (108) obtained by the second subnetwork (302).

10. Device (100) according to any one of claims 1 to 9, configured to, for estimating the number (201) of object instances in each class-level segment (106),

determine whether the class-level segment (106) contains a single object instance or multiple object instances, and

estimate, for a class-level segment (109) containing multiple object instances, the number of object instances based on the one or more instance-level center points (107).

11. Device (100) according to any one of claims 1 to 10, comprising

a Convolutional Neural Network, CNN, (300) for performing the class-level semantic segmentation (103) of the image.

12. Method (500) for instance-level semantic segmentation of an image (101), the method (500) comprising

performing (501) a class-level semantic segmentation (103) of the image (101) to obtain one or more class-level segments (106), each class-level segment (106) having an object class associated with it,

performing (502) an instance-level semantic boundary detection (103) on the image (101) to obtain one or more instance-level boundaries (108) and, for each instance-level boundary (108), an instance-level center point (107),

estimating (503), for each class-level segment (106), a number of object instances in the class-level segment (106), based on the number of instance-level center points (107) located in the class-level segment (106), and

performing (504), for each class-level segment (109) having an estimated number of object instances greater than one, a modified Simple Linear Iterative Clustering, SLIC, algorithm (105) based on the one or more instance-level boundaries (108) to obtain a plurality of superpixels as instance-level segments (110).

13. Computer program product comprising a program code for controlling the device (100) according to any one of claims 1 to 11, or for carrying out, when implemented on a processor, the method (500) according to claim 12.