WO2019015785A1

WO2019015785A1 - Method and system for training a neural network to be used for semantic instance segmentation

Info

Publication number: WO2019015785A1
Application number: PCT/EP2017/068550
Authority: WO
Inventors: Hiroaki Shimizu; Davy NEVEN; Bert DE BRABANDERE; Luc Van Gool; Marc PROESMANS; Nico CORNELIS
Original assignee: Toyota Motor Europe; Katholieke Universiteit Leuven
Priority date: 2017-07-21
Filing date: 2017-07-21
Publication date: 2019-01-24
Also published as: JP6989688B2; JP2020527812A

Abstract

A method for training iteratively a neural network to be used for semantic instance segmentation, wherein, for each iteration, the neural network outputs a vector (10, 20, 30) for each pixel of a template image, wherein the template image comprises predefined elements each associated with pixels of the template image and the corresponding vectors, characterized in that training the neural network is performed using a loss function defined so that the loss function decreases until reaching a target value at least when: - for each vector belonging to an element, the distance between the vector and a center of the vectors of this element decreases, and - the distances between all the centers of the vectors of each element increase. The invention also concerns a method for semantic instance segmentation and corresponding systems.

Description

Method and system for training a neural network to be used for semantic instance segmentation

Field of the invention

The present invention relates to the field of semantic instance segmentation, and more precisely to the training of neural networks used for semantic instance segmentation. Description of the Related Art

Semantic instance segmentation is a method for determining the types of objects in an image, for example acquired by a camera, while being able to differentiate objects of a same type.

In the prior art, both instance segmentation methods and semantic segmentation methods have been proposed and these methods often use neural networks or deep neural networks. Semantic segmentation methods have been used to differentiate objects having different types on an image. A semantic segmentation method cannot differentiate two objects having the same type. For example, if an image to be analyzed comprises two overlapping cars and two overlapping pedestrians, a semantic segmentation method will detect an area in the image corresponding to cars and an area in the image corresponding to pedestrians. Various methods for semantic segmentation have been proposed which generally use (deep) convolutional networks.

Instance segmentation methods only aim at identifying separate objects regardless of their type. If the above-mentioned image is analyzed using an instance segmentation method then what will be detected is four separate objects. Various methods have been proposed to achieve this, and most notably methods using (deep) convolutional networks. Some known methods require specific network architectures or rely on object proposals. For example, some methods use a multistage (or cascaded) pipeline in which the object proposal (or bounding box generation) is followed by a separate segmentation and/or classification step. These methods are not satisfying in terms of speed (because of the multistage computations) and segmentation quality (in particular when faced with occlusions).

It follows from the above that instance segmentation methods and semantic segmentation methods do not provide a complete answer on what actually appears on an image, and semantic instance segmentation methods are thus needed. A desired output of a semantic instance segmentation method for the above image could be a mask highlighting each car and each pedestrian with different colors and labels indicating, for example, carl, car2, pedestrianl, pedestrian2.

Various approaches have been proposed and most notably, some approaches suggest training a neural network to transform an image into a representation that is clustered, and In which each cluster of points corresponds to an instance (sometimes referred to as an element) in the image. This clustered representation may then be post-processed to obtain a representation of the image which highlights the different elements.

It should be noted that training a neural network is an iterative task which can be performed using template images in which each element has already been identified, and a loss-function.

A loss function typically consists in a calculation performed on the output of a neural network to determine if this output is valid, i.e. this output leads to a good detection of each element and its type. The loss function is generally a score which represents how far the output of a neural network is from an expected output.

Defining a loss function for a semantic instance segmentation method is highly critical, and known loss functions are not satisfying.

The document "Semantic Instance Segmentation via Deep Metric Learning" by Fathi et al. (hereinafter "Fathi", available for download on the arXiv pre-print server at the URL https://arxiv.org/pdf/1703.10277.pdf) discloses a known method for semantic instance segmentation.

The method of this document uses a loss function which will ensure that pixels that correspond to a same instance object are close in the space of the output of the neural network (typically, for each pixel in an image, a neural network outputs a vector). This loss function also ensures that pixels that correspond to different objects remain far from each other in the network's output representation. The loss function of this document therefore has a lower value when in the output of a neural network the vectors of pixels of a same object are close and when the vectors of pixels of different objects are far, and a lower value otherwise.

The neural network is then modified by taking into account the result of the loss function so as to obtain, in the next iteration, a lower score for the loss function.

The loss function of this document is unsatisfactory. More precisely, the loss function of this document relies on the random selection of a limited number of vectors for each object in the image, and uses extensive calculations.

The selection of a limited number of vectors also leads to a loss function which can be equal to zero while not all pixels verify the above conditions. This leads to a slow convergence of the training.

It is a primary object of the invention to provide methods and system that overcome the deficiencies of the currently available systems and methods.

Summary of the invention

The present invention overcomes one or more deficiencies of the prior art by proposing a method for training iteratively a neural network to be used for semantic instance segmentation, wherein, for each iteration, the neural network outputs a vector for each pixel of a template image, wherein the template image comprises predefined elements each associated with pixels of the template image and the corresponding vectors.

Training the neural network is performed using a loss function in which:

for each vector belonging to an element, the distance between the vector and a center of the vectors of this element is calculated,

the distances between all the centers of the vectors of each element are calculated,

so that the loss function decreases until reaching a target value (for example zero) at least when: - for each vector belonging to an element, the distance between the vector and a center of the vectors of this element decreases, and

- the distances between all the centers of the vectors of each element increase.

The target value is a value which is desired to obtain for the loss function. When the loss reaches or goes below the target value it can be considered that the training is complete. Optionally, the target value can be predetermined. In some iterations, or when used on a real image, the target value may not be reached. An element may be an object appearing on the template image. In the template image, there is a plurality of elements which may be of the same type or of different types. Each pixel in the template image has a known association with an element.

It has been observed that in the space of the vectors outputted by the neural network, a good semantic instance segmentation is obtained when:

- All the vectors of an element remain close in a cluster of vectors, - Clusters of vectors associated with different elements should be spaced apart.

Preferably, the neural network is a neural network already able to perform semantic segmentation, and the above-defined loss function is defined so as to train the network to also perform instance segmentation. The invention advantageously applies on any already available neural network which may have been trained for semantic segmentation, without requiring modifications of the architecture of the neural network.

The inventors of the present invention have observed that using a neural network which has already been trained for semantic segmentation allows obtaining better results for semantic instance segmentation.

The above loss function allows obtaining this result. Additionally, and contrary to the loss function of document Fathi mentioned above, the loss function of the invention is defined so as to take all the vectors into account (the distance between each vector and the corresponding center is calculated, and the distances between all the centers are calculated).

The use of centers of the vectors of an element allows taking into account all the vectors while limiting the computational requirements. As a matter of fact, if all the vectors of an element are close to their corresponding center, then the vectors of this element are all close together.

Also, if all the different centers remain far away from each other, then the vectors of an element are far away from the vectors of another element.

By way of example, the center of the vectors of an element may be determined as the mean vector of all the vectors belonging to a same element of the template image.

Because the loss function treats all the vectors in a computationally efficient manner, it is possible to obtain a loss which reaches the target value quickly and which actually means that all the vectors meet the expected requirements. A fast convergence of the training is obtained.

According to a particular embodiment, the loss function decreases until reaching the target value at least when for each vector belonging to an element, the distance between the vector and a center of the vectors of this element decreases until this distance is inferior or equal to a first predefined distance threshold.

Thus, when the training is finished or has converged (i.e. when the loss function has reached the target value): all the vectors of an element will be inside a hypersphere centered on the center of the vectors of this element, this hypersphere having a radius equal to the first predefined distance threshold.

According to a particular embodiment, the loss function decreases until reaching the target valueat least when the distances between all the centers of the vectors of each element increases until each of the distances is superior or equal to a second predefined distance threshold.

Thus, when the training is finished or has converged, (i.e. when the loss function has reached the target value): all the vectors of an element will be spaced apart from the vectors of another element by at least the second predefined distance threshold.

If the first and second predefined distance thresholds are used, then all the vectors from an element are within a hypersphere which is spaced apart from other hyperspheres of other elements.

It should be noted that this result cannot be obtained using the loss function disclosed in document Fathi. According to a particular embodiment, the loss function is:

L = · Lvar + β^■ Ldist

with:

a predefined constant,

β a predefined constant,

δν the first predetermined distance threshold,

6d being half the second predetermined distance threshold,

xi a vector of index i,

[x]₊ the positive part of x,

||xl - x2\\ the distance between vector xl and vector x2,

C the number of elements in the template image,

Nc the number of vectors in element c,

the center of element c, and

cA and cB elements A and B.

This loss function can be computed efficiently.

According to a particular embodiment, the loss function is further defined so as do decrease until reaching the target value at least when the distances between each center of the vectors of each element and the origin of the space of the vectors decreases.

This prevents the vectors from being too far away from the origin of the space of the vectors. This feature prevents the emergence of mathematical errors (for example infinity errors known to the skilled person).

According to a particular embodiment, the loss function comprises an additional term and is:

^• Lvar + β^■ Ldist + yLreg

With:

γ being a predefined constant. Thus, Lreg is a term which pulls the vectors towards the origin of the space of the vectors. It should be noted that yis preferably much less than or β as it plays a less preponderant role in the loss function. For example, a or β can have a value equal to 1 and γ can be 0,001.

According to a particular embodiment, the coordinates of each pixel of an image inputted to the neural network, the coordinates of this pixel are inputted to the neural network.

It has been observed by the inventors that elements which have a similar appearance arranged in a specific manner (for example an element in an upper right corner and a similar element in a lower left corner) may not be considered as two separate instances or elements. By inputting the coordinates of the pixels of the template image to the neural network, the neural network receives enough information to differentiate the two elements.

The invention also provides a method for semantic instance segmentation comprising using the neural network trained using the above defined method.

According to a particular embodiment, the method further comprises a post-processing step in which the mean-shift algorithm or the k-means algorithm is applied to the vectors outputted by the neural network.

In the output of the trained network, the vectors are likely to be placed in distinct and separate hyperspheres, this facilitates the implementation of the mean-shift algorithm or of the k-means algorithm. These algorithms facilitate the identification of pixels belonging to an object.

The invention also provides a system for training iteratively a neural network to be used for semantic instance segmentation, wherein, for each iteration, the neural network is configured to output a vector for each pixel of a template image,

wherein the template image comprises predefined elements each associated with pixels of the template image and the corresponding vectors.

The system comprises a module for calculating a loss using a loss function for each iteration, the loss function being defined so as to decrease until reaching a target value at least when: - for each vector belonging to an element, the distance between the vector and a center of the vectors of this element decreases, and

This system may be configured to perform all the embodiments of the method for training a neural network as defined above.

The invention also provides a system for image semantic instance segmentation comprising the neural network trained using the method for training a network as defined above.

In one particular embodiment, the steps of the method for training a neural network and/or the steps of the method for semantic instance segmentation are determined by computer program instructions.

Consequently, the invention is also directed to a computer program for executing the steps of a method as described above when this program is executed by a computer.

This program can use any programming language and take the form of source code, object code or a code intermediate between source code and object code, such as a partially compiled form, or any other desirable form.

The invention is also directed to a computer-readable information medium containing instructions of a computer program as described above.

The information medium can be any entity or device capable of storing the program. For example, the medium can include storage means such as a ROM, for example a CD ROM or a microelectronic circuit ROM, or magnetic storage means, for example a diskette (floppy disk) or a hard disk.

Alternatively, the information medium can be an integrated circuit in which the program is incorporated, the circuit being adapted to execute the method in question or to be used in its execution.

Brief description of the drawings

How the present invention may be put into effect will now be described by way of example with reference to the appended drawings, in which: - figure 1 is a block diagram of an exemplary method for training a neural network,

- figure 2 is a block diagram of an exemplary semantic instance segmentation method,

- figure 3 is a schematic diagram of a system for training a neural network and a system for semantic instance segmentation,

- figure 4 is a representation of the vectors outputted by a neural network,

- figure 5 illustrates the training of a neural network, and

- figure 6 illustrates the effect of inputting the coordinates of pixels to the neural network.

Description of the embodiments A method for training iteratively a neural network is represented on figure 1.

This training is performed using a template image 1 comprising various elements, for example elements of the same type which may or may not overlap (for example two overlapping cars).

The position of each element in the template image is previously known, and each pixel of this template image has a previously known association with an element (for example car number 1, car number 2, background, etc.).

In a first step E01, the neural network to be trained transforms the template image into a plurality of vectors, each vector corresponding to a pixel of the template image. This plurality of vectors is sometimes called a tensor by the person skilled in the art, and this tensor has the same height and width as the template image, but a different depth equal to the length of the vectors.

The length of the vectors can be chosen depending on the neural network to be trained, or depending on the application. All the vectors have the same length and they all belong to the same vector space.

It should be noted that vectors outputted by a neural network are sometimes called pixel embedding by the person skilled in the art.

Preferably, the neural network is initially, before the training, a neural network already able to perform semantic segmentation. The skilled person will know which neural network can already perform semantic segmentation.

By way of example, the neural network may be a neural network known to the skilled person under the name "Segnet" and described in document "A deep convolutional encoder-decoder architecture for image segmentation" (V '. Badrinarayanan et al., arXiv preprint arXiv:1511.00561, 2015. 2), or a neural network described in document "Fully convolutional networks for semantic segmentation" (X Long et al., CVPR, 2015).

The neural network outputs vectors referenced 2 on figure 1.

These vectors are then used to calculate a loss in step E02. The loss is calculated using a loss function which delivers a scalar value which is positive or zero often called a loss.

More precisely, in the loss function, for each vector belonging to an element, the distance between the vector and a center of the vectors of this element is calculated,

the distances between all the centers of the vectors of each element are calculated.

These calculations are used to define a loss function which decreases (between two consecutive iterations) until reaching the target value, which is zero in the present example, at least when:

- for each vector belonging to an element, the distance between the vector and a center of the vectors of this element decreases, and

For example, the loss function L can be a linear combination of two terms:

L = a · Lvar + β · Ldist

With:

a a predefined constant (preferably positive, for example equal to 1), β a predefined constant (preferably positive, for example equal to 1), Lvar a term which decreases until reaching zero at least when for each vector belonging to an element, the distance between the vector and a center of the vectors of this element decreases,

Ldist a term which decreases until reaching zero at least when the distances between all the centers of the vectors of each element increase. a and β may be chosen through a grid search or a hyperparameter search with an evaluation performed on a validation set. This can be performed by trying different settings in a structured way so as to choose the best value for and β.

It should be noted that these values may both be set at 1.

For example, the loss function can decrease until reaching zero at least when for each vector belonging to an element, the distance between the vector and a center of the vectors of this element decreases until this distance is inferior or equal to a first predefined distance threshold.

Thus, this example can be implemented by using a term Ivor written as:

With:

δν the first predetermined distance threshold,

xi a vector of index i,

[x]+ the positive part of x,

11*1 - x2|| the distance between vector xl and vector x2,

C the number of elements in the template image,

Nc the number of vectors in element c, and

the center of element c

It should be noted that the distance can be the LI or L2 distance well known to the person skilled in the art.

Also for example, the loss function can decrease until reaching zero at least when for each vector belonging to an element, the distance between the vector and a center of the vectors of this element decreases until this distance is inferior or equal to a first predefined distance threshold.

Thus, this example can be implemented by using a term Ldist written as:

With (the notations used above for Lvar are also used for Ldist):

5d being half the second predetermined distance threshold, and

cA and cB elements A and B. The two above defined terms Ldist and Lvar are defined so as to ensure that when the loss is equal to zero all the vectors associated with an object are located inside an hypersphere having a radius equal to δν and the centers of all the hyperspheres are separated by at least 2Sd.

Preferably, 5d is superior to 2 δν.

By way of example, it should be noted that the loss function can be further defined so as do decrease until reaching zero at least when the distances between each center of the vectors of each element and the origin of the space of the vectors decreases.

In this example, the loss function comprises an additional term Lreg and is:

L = - Lvar + β · Ldist + yLreg

With:

γ being a predefined constant.

It should be noted that yis preferably much less than a or β as it plays a less preponderant role in the loss function. For example, a or β can have a value equal to 1 and y can be 0,001.

Using for example the above defined functions, it is possible to calculate the loss of step E02. If this loss is equal to zero then it is considered that the neural network is trained. Alternatively, it can be considered that the neural network is trained when the loss is below a predefined threshold.

If the loss is above zero (or above a predefined threshold) then the training is not completed. Step E03 is then performed and in this step, the parameters or weights or the neural network are adjusted using the loss calculated in step E02.

Step E03 can be performed for example using the method known to the skilled person as stochastic gradient descent (step E03).

Then, the next iteration is carried out (step E04) which consists in at least performing steps E01 and E02 with the adjusted neural network.

It should be noted that training the neural network can be performed on a plurality of different template images. Once the neural network has been trained using the method disclosed on figure 1, it can be used for semantic instance segmentation, as represented on figure 2.

The method of figure 2 is performed on an image referenced 3, for example an image which has been acquired by a camera. This image can comprise, for example, two partially overlapping cars and two partially overlapping pedestrians.

In step Ell, the image 3 is inputted to the trained neural network so as to perform semantic instance segmentation.

Vectors 4 are obtained as the output of the trained neural network.

In order to represent the image 3 under a representation in which the different elements of the image are segmented and labelled (for example car number one in a first color, car number two in a second color, pedestrian one in a third color, and pedestrian two in a fourth color), a post-processing step E12 is carried out.

Because the neural network has been trained using the above- defined loss function, the vectors are close to being in separate hyperspheres. It should be noted that in most cases and when used on a real image (and not a template image), the loss is typically slightly above zero.

In order to facilitate the post-processing, a sub-step E120 is performed in which the K-means of mean shift algorithm is used on the vectors so as to group together in clusters the pixels which should belong to the same object. This increases the robustness of the method.

Once the vectors (or pixels) have been grouped in separate clusters, then it is possible to output an image with different colors for each cluster and the post processing is finished. This can be performed by thresholding around said centers with a radius which can be of Sd.

A final image 5 with semantic instance segmentation is outputted.

The steps of the methods described in reference to figures 1 and 2 can be determined by computer instructions. These instructions can be executed on a processor of a computer, as represented on figure 3.

On this figure, a system for training a neural network SI has been represented. The system SI, which can be a computer, comprises a processor PR1 and a non-volatile memory MEM1. In the non-volatile memory MEM1, a set of instructions INST1 is stored. The set of instructions INST1 comprises instructions to perform a method for training a neural network to perform semantic instance segmentation, for example the method described in reference to figure 1.

The non-volatile memory MEM1 further comprises a neural network NN and at least one template image TIMG.

Once trained, the neural network NN can be used in a separate system S2 configured to perform semantic instance segmentation.

By way of example, the neural network NN can be communicated to the system S2 using a communication network INT, for example the Internet.

The system S2 comprises a processor PR2 and a non-volatile memory MEM2 in which a set of instructions INST2 is stored to perform semantic instance segmentation using an image IMG stored in the non- volatile memory MEM2 and the trained neural network TNN also stored in the non-volatile memory MEM2.

Figure 4 is a schematic representation of the vectors outputted by a neural network.

In this example, the neural network which has been used is not fully trained and the loss is not equal to zero or below a predefined threshold. Also, in this example and for the sake of simplicity, the neural network outputs vectors having a length of 2, which allows using a two- dimensional representation.

The various vectors outputted by the neural network are represented as dots 10, 20, and 30 each associated with a pixel of the template image inputted to the neural network. Each pixel of the template image has a known association with an element visible on the template image. Thus, the same is also true for the vectors outputted by the neural network.

On figure 4, the vectors referenced 10 are all associated with a first object, the vectors referenced 20 are all associated with a second object, and the vectors referenced 30 are all associated with a third object. Even if the training of the neural network is still being performed, the vectors 10, 20 and 30 already substantially form clusters of vectors respectively referenced CI, C2, and C3. It is then possible to determine the centers 11, 12, and 13, respectively of cluster CI, cluster C2, and cluster C3.

These centers are used to calculate the distances between each vector 10 and the center 11, the distances between each vector 20 and the center 12, and the distances between each vector 30 and the center 13.

Additionally, the distances between all the centers 11, 12 and 13 are computed.

The loss function is defined so that the vectors 10 get closer (after each iteration of the training) to the center 11, and more precisely so that the vectors 10 are within a distance from the center which is less than a first predefined distance threshold δν. The vectors 10 are expected to all be inside the circle having a radius δν and a center 11 represented on the figure.

Similarly, the vectors 20 are expected to all be inside the circle having a radius δν and a center 12 represented on the figure, and the vectors 30 are expected to all be inside the circle having a radius δν and a center 13 represented on the figure.

The movements of the vectors towards their center have been represented using thin arrows on the figure.

The loss function is further defined so that the centers 11, 12 and 13 get further away from each other (after each iteration of the training). More precisely, so that the centers are each separated by a second predefined distance threshold equal to 2δά.

The circles of radius δά and centers 11, 12 and 13 are also represented on the figures. The movements of the vectors carried out to move the clusters away from each other are represented using thick arrows on the figure.

Figure 5 illustrates the training of a neural network through various representations.

A template image 100 is represented on the figure, and this image 100 is a photograph of a plant having a variety of leafs and a background to be segmented.

Each pixel in the template image 100 has a known association with a specific leaf and it is possible to represent the image 100 as the final segmented image 200 shown below the template image 100. The skilled person may refer to the segmented image 200 as the "ground truth".

The row referenced 300 on figure 5 represents the positions of the vectors outputted by the neural network (in this example, the output of the network is in two dimensions) at seven different stages of the training of the neural network in consecutive order from left to right. This training is done by using the stochastic gradient descent method to adjust the neural network after each training iteration.

The seven different stages represented in the row 300 correspond respectively to 0, 2, 4, 8, 16, 32, and 64 adjustments to the neural network using the stochastic gradient descent method.

As can be seen in the last stage, the vectors are all placed in non- overlapping circles. These circles have a radius equal to Sd described in reference to figure 4.

The row referenced 400 represents the output of the neural network without a post-processing step. The images of this row are obtained by taking the output of the neural network which delivers vectors having two dimensions, and using each component of each vector respectively as a red value and as green value, the blue value being set at zero (the figure is in greyscale). The row referenced 500 represents the result of a post-processing step in which a thresholding is performed with a radius equal to Sd.

Figure 6 illustrates the effect of inputting the coordinates of pixels to the neural network.

On this figures three different input images are used in which two similar elements (a square) are placed an upper left position and at a lower right position. The two squares are spaced differently in the three input images.

On the figures, the wording location awareness refers to the inputting of the coordinates of each pixel to the neural network.

The output of the neural network (vectors and corresponding image) is then shown for the two cases which are with and without location awareness.

It can be seen that the neural network has difficulties to differentiate the two squares when they respectively are close to the upper left corner and the lower right corner without location awareness. However, by inputting the coordinates of the pixels of the image to the neural network, the neural network is always able to differentiate the two squares.

The above described embodiments allow obtaining a neural network which can be used for semantic instance segmentation with good results.

Very few mistakes are seen on the template images at the end of the training, because the loss function can reach a value of zero which truly indicates that the training on a template image is complete.

Using at least the metric known to the skilled person as Symmetric

Best Dice (SBD - disclosed in document "Leaf segmentation in plant phenotyping: a collation study", H. Scharr et al, Machine vision and applications, 27(4):585-606), it is possible to obtain an SBD score of 84.2.

The neural networks obtained using the above embodiments therefore provide good results.

Claims

1. A method for training iteratively a neural network to be used for semantic instance segmentation, wherein, for each iteration, the neural network outputs a vector (10, 20, 30) for each pixel of a template image, wherein the template image comprises predefined elements each associated with pixels of the template image and the corresponding vectors,

characterized in that training the neural network is performed using a loss function (L) in which:

for each vector belonging to an element, the distance between the vector and a center (11, 12, 13) of the vectors of this element is calculated,

so that the loss function decreases until reaching a target value at least when:

2. The method according to claim 1, wherein the loss function decreases until reaching the target value at least when for each vector belonging to an element, the distance between the vector and a center of the vectors of this element decreases until this distance is inferior or equal to a first predefined distance threshold (δν).

3. The method according to claim 1 or 2, wherein the loss function decreases until reaching the target value at least when the distances between all the centers of the vectors of each element increases until each of the distances is superior or equal to a second predefined distance threshold (Sd).

4. The method according to any one of claims 1 to 3, wherein the loss function is: L = a - Lvar + β^■ Ldist

with:

Ldist \

a a predefined constant,

β a predefined constant,

δν the first predetermined distance threshold,

Sd being half the second predetermined distance threshold,

xi a vector of index i,

[x]₊ the positive part of x,

11*1 - x2\\ the distance between vector xl and vector x2,

C the number of elements in the template image,

Nc the number of vectors in element c,

μ the center of element c, and

cA and cB elements A and B.

5. The method according to any one of claims 1 to 4, wherein the loss function is further defined so as do decrease until reaching the target value at least when the distances between each center of the vectors of each element and the origin of the space of the vectors decreases.

6. The method according to the combination of claims 4 and 5, wherein the loss function comprises an additional term and is:

L = a - Lvar + β · Ldist + yLreg

With:

γ being a predefined constant.

7. The method according to any one of claims 1 to 6, wherein the coordinates of each pixel of an image inputted to the neural network, the coordinates of this pixel are inputted to the neural network.

8. A method for semantic instance segmentation comprising using the neural network trained using the method pursuant to any one of claims 1 to 7 on an image.

9. The method of claim 7, further comprising a post-processing step in which the mean-shift algorithm or the k-means algorithm is applied to the vectors outputted by the neural network.

10. A system for training iteratively a neural network to be used for semantic instance segmentation, wherein, for each iteration, the neural network (NN) is configured to output a vector for each pixel of a template image (TIMG),

wherein the template image comprises predefined elements each associated with pixels of the template image and the corresponding vectors,

characterized in that the system comprises a module (PR1, INST1) for calculating a loss using a loss function in which:

11. A system for image semantic instance segmentation comprising the neural network trained using the method pursuant to any one of claims 1 to 7.

12. A computer program including instructions for executing the steps of a method according to any one of claims 1 to 9 when said program is executed by a computer.

13. A recording medium readable by a computer and having recorded thereon a computer program including instructions for executing the steps of a method according to any one of claims 1 to 9.