EP4241237A1

EP4241237A1 - Device and method for improving the determining of a depth map, a relative pose, or a semantic segmentation

Info

Publication number: EP4241237A1
Application number: EP20807341.1A
Authority: EP
Inventors: Onay URFALIOGLU; Akhil GURRAM
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2023-09-13
Also published as: WO2022100843A1; CN114793457A

Abstract

The present disclosure relates to the field of advanced driver assistance systems (ADAS), computer vision and machine learning (ML). The present disclosure provides an ML based way to train a neural network based on a synthesized image (which is generated based on a real image or a virtual image) to improve determining of a depth map, a relative pose or a semantic segmentation. The present disclosure therefore provides a device (100) for determining a depth map (101), a relative pose (102), or a semantic segmentation (103). The device (100) comprises a neural network (104) configured to, in an inference phase, determine the depth map (101), the relative pose (102), or the semantic segmentation (103) based on an input image (105); and a generator (106) configured to, in a training phase, generate a synthesized image (107) based on a real image (108) or a virtual image (109), and based on a loss function (110), and train the neural network (104) based on the synthesized image (107); wherein the loss function (110) comprises a semantic edge function (111).

Description

DEVICE AND METHOD FOR IMPROVING THE DETERMINING OF A DEPTH

MAP, A RELATIVE POSE, OR A SEMANTIC SEGMENTATION

TECHNICAL FIELD

The present disclosure relates to the field of advanced driver assistance systems (ADAS), computer vision and machine learning (ML). The present disclosure provides an ML based way to train a neural network based on a synthesized image (which is generated based on a real image or a virtual image) to improve determining of a depth map, a relative pose or a semantic segmentation (which can e.g. be used for training an autopilot of a self-driving vehicle). Moreover, the present disclosure relates to domain adaptation, i.e. to solving computer vision problems by training a neural network on virtual images and testing on real images.

BACKGROUND

Estimation of a relative pose, a depth map or a semantic segmentation based on sensor input is an important task for a robot, an ADAS or a self-driving system. Sensor input used for said estimation e.g. comes from a visual sensor (e.g. an input image taken by a digital camera). The information (depth map, semantic segmentation or relative pose), which is estimated based on the input image, can be used (together with the input image itself) for further training of the robot, the ADAS or the self-driving system). For example, an auto-pilot of a vehicle can be trained based on the input image and based on the corresponding relative pose, depth map or semantic segmentation.

With the fast development of neural networks to solve computer vision problems, estimation techniques such as stereo matching or self- supervised deep learning methods have made progress, but require a large quantity of real high-quality stereo images or real sequential images as input images for training of estimation. Even though real stereo images or real sequential images are easy to produce, it is difficult to create an accurate depth map, relative pose or semantic segmentation (which could be used as ground truths for further training) corresponding to said real images.

Instead, creating virtual images (which are not taken by a camera, but e.g. are generated by a computer) with corresponding accurate information regarding depth map, relative pose or semantic segmentation is feasible. In this way, a large quantity of training data (e.g. for training the robot, the ADAS or the self-driving system) can be created. However, a domain gap between virtual images and real images can be noticed e.g. because an image texture or a color intensity of a virtual image are not as good as in a real image. This domain gap also decreases the quality of estimation of a depth map, a relative-pose estimation or a semantic segmentation, based on an input image, if the neural network which performs the estimation is only trained based on the virtual images.

Conventional approaches to solve this problem are learning domain-invariant features or a domain-invariant representation using a deep neural networks, or pushing two domain distributions to be close to teach other. However, these approaches deliver ineffective results. That is, a domain gap between a virtual image and a real image is not effectively reduced by the conventional approaches.

SUMMARY

In view of the above-mentioned problem, an objective of embodiments of the present disclosure is to improve domain adaptation between different domains of images, such as virtual images and real images.

This or other objectives may be achieved by embodiments of the present disclosure as described in the enclosed independent claims. Advantageous implementations of embodiments of the present disclosure are further defined in the dependent claims.

A first aspect of the present disclosure provides a device for determining a depth map, a relative pose, or a semantic segmentation, wherein the device comprises a neural network configured to, in an inference phase, determine the depth map, the relative pose, or the semantic segmentation based on an input image; and a generator configured to, in a training phase, generate a synthesized image based on a real image or a virtual image, and based on a loss function, and train the neural network based on the synthesized image; wherein the loss function comprises a semantic edge function.

This ensures that a domain gap between virtual images and real images can be effectively reduced. Thus, a dependency on creating real images for training the neural network can be avoided and virtual images can be used, for which labels (e.g. indicating a segmentation, a pose or a depth) can be created automatically. The semantic edge function in particular ensures that the domain gap is effectively reduced.

In particular, the input image can be a real image or a virtual image, which is input to the device. In particular, the real image or the virtual image is input to the device for generating training data for an autopilot of a vehicle.

In particular, the real image is a 2-dimensional picture (e.g. an RGB or a chrominanceluminance picture) acquired with a stereo camera in a real environment. The real image is e.g. acquired based on at least one of: KITTI, CITYSCAPES, AEV datasets.

In particular, the virtual image is a 2-dimensional picture (e.g. an RGB picture or a chrominance-luminance picture) acquired with a stereo camera in a virtual environment such as Carla, or a photorealistic dataset.

In particular, the synthetized image is a 2-dimensional picture (e.g. an RGB picture or a chrominance-luminance picture) generated by a generative adversarial network (GAN) based on a real image or a virtual image.

In particular, a depth map is a 2D image or a matrix, in which each pixel or element depicts a depth of a corresponding 3D point in a scene with respect to a camera. In particular, a scene is a predefined region of interest of a real world, captured by a camera.

In particular, the difference in x-coordinates of the projection of a 3D point in a scene into a stereo camera image pair (left, right) is calculated according to the formula: I_left(x+d) = I_right(x), wherein d is a disparity.

In particular, a disparity map is a 2D image or matrix where each pixel or element depicts the disparity of that pixel or element.

In particular, a relative pose is a 6D vector comprising 3D location coordinates (e.g. x, y, z) and 3 angles for an orientation of a vehicle (e.g. yaw, pitch, roll). In particular, a semantic segmentation comprises a classification for each pixel of the input image what kind of object it is depicting (e.g. at least one of a car a vegetation, a building, a sky, a road).

In particular, a semantic edge function considers at least one edge in the real image or the virtual image. In particular, an edge comprises a border of an object in an image, wherein the border comprises a significant contrast change.

In an implementation form of the first aspect, the semantic edge function is configured to maintain semantic gradient information and/or edge information in the synthesized image.

This ensures that the domain gap between real images and synthesized images is reduced based on the semantic gradient information and/or edge information in the synthesized image.

In a further implementation form of the first aspect, the device further comprises a first discriminator and a second discriminator, wherein the generator is further configured to, in the training phase, provide the synthesized image to the first discriminator or to the second discriminator, to train the neural network.

This ensures that a texture or color intensity in the synthesized image can be improved and the domain gap between the synthesized image and the real image can be reduced.

In particular, a texture is an area of an image which depicts content having significant variation in color intensities.

In a further implementation form of the first aspect, the generator is further configured to, in the training phase, train the neural network based on a determination result of the discriminator to which the synthesized image was provided.

This ensures that a texture or color intensity in the synthesized image can be further improved and the domain gap between the synthesized image and the real image can be reduced. In a further implementation form of the first aspect, the generator is further configured to, in the training phase, randomly provide the synthesized image to the first discriminator or to the second discriminator.

This ensures that a texture or color intensity in the synthesized image can be further improved and the domain gap between the synthesized image and the real image can be reduced.

In a further implementation form of the first aspect, the first discriminator is further configured to, in the training phase, determine that a synthesized image that is generated by the generator based on a virtual image, is a fake, and determine that the virtual image is an original; and the second discriminator is further configured to, in the training phase, determine that a synthesized image that is generated by the generator based on a real image, is a fake, and determine that the real image is an original.

In a further implementation form of the first aspect, the device is further configured to train the neural network, based on the synthesized image, for determining at least one of a depth map, a relative pose, a semantic segmentation.

This ensures that determining at least one of a depth map, a relative pose, a semantic segmentation can be improved based on the synthesized image.

In a further implementation form of the first aspect, the device is further configured to, in the training phase, generate a learnable mask based on the synthesized image; and train the neural network based on the learnable mask.

This ensures that the domain gap can also be reduced based on the learnable mask.

In particular, the learnable mask is a region of interest in the synthesized image. In particular, the learnable mask allows to determine if a pixel of the synthesized image can be used for view reconstruction or not. In a further implementation form of the first aspect, the learnable mask is a semantic inlier mask.

This ensures that the domain gap can be reduced also based on the semantic inlier mask.

In particular, the semantic inlier mask is a region of interest in the synthesized image. In particular, the semantic inlier mask allows to determine if a pixel of the synthesized image can be used for view reconstruction or not. In particular, the semantic inlier mask comprises semantic segmentation information. The semantic segmentation information can be used for determining, if a pixel of the synthesized image can be used for view reconstruction or not. In particular, the semantic inlier mask is generated based on semantic segmentation information by a neural network.

In a further implementation form of the first aspect, the device is further configured to train the neural network, based on the learnable mask, for determining at least one of a depth map, a relative pose, a semantic segmentation.

This ensures that determining at least one of a depth map, a relative pose, a semantic segmentation can be improved based on the learnable mask.

In particular, the neural network is trained, based on the semantic inlier mask, for determining a depth map.

In a further implementation form of the first aspect, the device is further configured to, in the training phase, determine segmentation information based on the synthesized image and generate the learnable mask based on the segmentation information.

This ensures that segmentation information can be considered by the learnable mask.

In a further implementation form of the first aspect, the device is further configured to, in the training phase, determine pose information based on the synthesized image and generate the learnable mask based on the pose information.

This ensures that pose information can be considered by the learnable mask. In a further implementation form of the first aspect, the device is further configured to, in the training phase, determine an inlier mask based on the synthesized image and generate the learnable mask based on the inlier mask.

This ensures that an inlier mask can be considered by the learnable mask.

In particular, the inlier mask is a region of interest in the synthesized image. In particular, the inlier mask allows to determine if a pixel of the synthesized image can be used for view reconstruction or not.

In particular, the device is further configured to, in the training phase, if the synthesized image is generated based on the real image, apply self-supervised training to the neural network based on the synthesized image.

In particular, the device is further configured to, in the training phase, if the synthesized image is generated based on the virtual image, apply supervised training to the neural network based on the synthesized image and/or ground truth labels corresponding to the synthesized image.

In particular, the device is the ground truth labels comprise a depth map, a relative pose, or a semantic segmentation.

In particular, the device further comprises a third discriminator and a fourth discriminator, wherein the device is further configured to train the neural network based on a determination result of the third discriminator, and/or based on a determination result of the fourth discriminator.

In particular, the device is configured to train the neural network for determining a depth map based on the third discriminator.

In particular, the device is configured to train the neural network for determining a semantic segmentation based on the fourth discriminator. A second aspect of the present disclosure provides a method for determining a depth map, a relative pose, or a semantic segmentation, the method comprising the steps of: in an inference phase, determining, by a neural network of a device, the depth map, the relative pose, or the semantic segmentation based on an input image; and in a training phase, generating, by a generator of the device, a synthesized image based on a real image or a virtual image, and based on a loss function, and training, by the generator, the neural network based on the synthesized image; wherein the loss function comprises a semantic edge function.

In an implementation form of the second aspect, the semantic edge function maintains semantic gradient information and/or edge information in the synthesized image.

In a further implementation form of the second aspect, the method further comprises, in the training phase, providing, by the generator, the synthesized image to a first discriminator of the device or to a second discriminator of the device, to train the neural network.

In a further implementation form of the second aspect, the method further comprises, in the training phase, training, by the generator, the neural network based on a determination result of the discriminator to which the synthesized image was provided.

In a further implementation form of the second aspect, the method further comprises, in the training phase, randomly providing, by the generator, the synthesized image to the first discriminator or to the second discriminator.

In a further implementation form of the second aspect, the method further comprises, in the training phase, determining, by the first discriminator, that a synthesized image that is generated by the generator based on a virtual image, is a fake, and determining, by the first discriminator, that the virtual image is an original; and, in the training phase, determining, by the second discriminator, that a synthesized image that is generated by the generator based on a real image, is a fake, and determine that the real image is an original.

In a further implementation form of the second aspect, the method further includes training, by the device, the neural network, based on the synthesized image, for determining at least one of: a depth map, a relative pose, a semantic segmentation. In a further implementation form of the second aspect, the method further includes, in the training phase, generating, by the device, a learnable mask based on the synthesized image; and training, by the device, the neural network based on the learnable mask.

In a further implementation form of the second aspect, the learnable mask is a semantic inlier mask.

In a further implementation form of the second aspect, the method further includes training, by the device, the neural network, based on the learnable mask, for determining at least one of: a depth map, a relative pose, a semantic segmentation.

In a further implementation form of the second aspect, the method further includes, in the training phase, determining, by the device, segmentation information based on the synthesized image and generating, by the device, the learnable mask based on the segmentation information.

In a further implementation form of the second aspect, the method further includes, in the training phase, determining, by the device, pose information based on the synthesized image and generate the learnable mask based on the pose information.

In a further implementation form of the second aspect, the method further includes, in the training phase, determining, by the device, an inlier mask based on the synthesized image and generating, by the device, the learnable mask based on the inlier mask.

The second aspect and its implementation forms include the same advantages as the first aspect and its respective implementation forms.

A third aspect of the present disclosure provides a computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method of the second aspect or any of its implementation forms.

The third aspect and its implementation forms include the same advantages as the second aspect and its respective implementation forms. It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The above-described aspects and implementation forms of the present disclosure will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which

FIG. 1 shows a schematic view of a device according to an embodiment of the present disclosure;

FIG. 2 shows a schematic view of a device according to an embodiment of the present disclosure in more detail;

FIG. 3 shows a schematic view of an operating scenario according to the present disclosure;

FIG. 4 shows a schematic view of a depth map according to the present disclosure;

FIG. 5 shows a schematic view of an operating scenario according to the present disclosure;

FIG. 6 shows a schematic view of an operating scenario according to the present disclosure;

FIG. 7 shows a schematic view of an operating scenario according to the present disclosure;

FIG. 8 shows a schematic view of an operating scenario according to the present disclosure; FIG. 9 shows a schematic view of a method according to an embodiment of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS

FIG. 1 shows a schematic view of a device 100 according to an embodiment of the present disclosure. The device 100 is for determining a depth map 101, a relative pose 102, or a semantic segmentation 103 based on an input image 105. Moreover, said determining is improved by an ML based approach. To this end, the device 100 comprises a neural network 104 and a generator 106. The determining is in particular improved based on the neural network 104, which is trained based on a synthesized image 107.

Therefore, the device 100 distinguishes between an inference phase and a training phase. In the inference phase, the trained neural network 104 is applied to an input image 105 to determine the depth map 101, the relative pose 102, or the semantic segmentation 103 based on the input image 105. In the training phase, training data (i.e. a real image 108 or a virtual image 109) is provided to the neural network 104. Therefore, the generator 106 is configured to generate a synthesized image 107 based on a real image 108 or a virtual image 109. For the generation of the synthesized image 107, the generator also uses a loss function 110. The loss function 110 in particular comprises a semantic edge function 111. Once the synthesized image 107 is generated, the generator 106 trains the neural network 104 based on the synthesized image 107.

Optionally, the semantic edge function 111 can maintain semantic gradient information and/or edge information in the synthesized image 107.

In other words, the device 100 can use virtual images 109 along with ground truth labels (e.g. a depth map, a semantic segmentation or a relative pose) which are specifically generated for the virtual images 109 to train the neural network 104 and e.g. test it on real images 108.

Based on the real image 109 and the virtual image 108, which correspond to two different domains, a synthesized image 107 corresponding to an intermediate domain can be created. Training the neural network 104 based on a synthesized image 107 from the intermediate domain allows for producing a robust and accurate depth map 101, relative pose 102 or semantic segmentation 103, independent of domain specific texture features or distribution of images. The device 100 may comprise a processor or processing circuitry (not shown) configured to perform, conduct or initiate the various operations of the device 100 described herein. The processing circuitry may comprise hardware and/or the processing circuitry may be controlled by software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as applicationspecific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors.

The device 100 may further comprise memory circuitry, which stores one or more instruction(s) that can be executed by the processor or by the processing circuitry, in particular under control of the software. For instance, the memory circuitry may comprise a non-transitory storage medium storing executable software code which, when executed by the processor or the processing circuitry, causes the various operations of the device 100 to be performed.

In one embodiment, the processing circuitry comprises one or more processors and a non- transitory memory connected to the one or more processors. The non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the device 100 to perform, conduct or initiate the operations or methods described herein.

FIG. 2 shows a schematic view of a device 100 according to an embodiment of the present disclosure in more detail. The device 100 shown in FIG. 2 comprises all features and functionality of the device 100 of FIG. 1, as well as the following optional features:

As it is illustrated in FIG. 2, the device 100 optionally comprises a first discriminator 201 and a second discriminator 202. In the training phase, the generator 106 optionally can provide the synthesized image 107 to the first discriminator 201 or to the second discriminator 202, to train the neural network 104. The first discriminator 201 and the second discriminator 202 support the generator 106 to create a synthesized image 107 with a texture similar to the domains using virtual image 109 and real image 108.

More specifically, the generator 106 can provide the synthesized image 107 to the first discriminator 201 or to the second discriminator 202 on a random basis. The generator 106 then trains the neural network 104 based on a determination result of the discriminator 201, 202 to which the synthesized image 107 was provided. In other words, the synthesized image 107 is produced by the generator 106 and by the first discriminator 201 and the second discriminator 202 based on an adversarial loss function (i.e. the loss function 110), which mainly considers universal common features, such as semantic edges and semantic information, patterns, scene structure, as well as artifacts such as texture, color filters, noise, lighting effects, shadows and reflections produced by a camera sensor for training the neural network 104.

Again in other words, for the loss function 110, e.g. instead of using mean square error (MSE) loss for reconstruction of input images 105, a semantic gradient or edge based reconstruction can be used (potentially in combination with a robust loss function). This supports the generator 106 to reconstruct the synthesized image 107 by keeping the scene structure and geometric properties and mix or create the texture, intensities, or quality between a virtual image 109 and a real image 108.

Optionally, the loss function 110 is a semantic edge function 111, the purpose of which can be to maintain semantic gradient information and/or semantic edge information to generate a synthesized image 107. Additionally, by using the first discriminator 201 and the second discriminator 202, the generator 106 can produce an image texture similar to the real image domain, without losing geometry structure.

The synthesized images 107, which are produced by the generator 106, the first discriminator 201 and the second discriminator 202, can be used for further training of the neural network 104, e.g. for improving determining a depth map 101, a relative pose 102 or a semantic segmentation 103 based on the synthesized image 107. This is in particular supported by a respective loss function 110, e.g. a Ll/view reconstruction function, a cross-entropy function, or an LI function.

Again in other words, the generator 106 and the discriminators 201, 202 generate a synthesized image 107 by considering that semantic edges in the synthesized image 107 should be same as in the original input (that is, in the real image 108 or the virtual image 109), wherein the synthesized image 107 has a higher variation of texture. In particular, providing the synthesized image 107 to the first discriminator 201 or the second discriminator 202 on a random basis allows to generate, by the generator 106, the synthesized image 107, which has a higher variation of textures.

To achieve this effect, in the training phase the first discriminator 201 determines that a synthesized image 107, which is generated by the generator 106 based on a virtual image 109, is a fake, and determines that the virtual image 109 is an original. For the same purpose, in the training phase the second discriminator 202 determines that a synthesized image 107, which is generated by the generator 106 based on a real image 108, is a fake, and determines that the real image 108 is an original.

As it is further illustrated in FIG. 2, in the training phase the device 100 optionally generates a learnable mask 203 based on the synthesized image 107. The learnable mask 203 indicates pixels in the synthesized image 107, which are important for further training the neural network 104. Thus, the device 100 optionally further trains the neural network 104 based on the learnable mask 203 and based on the synthesized image 107. Optionally, the learnable mask 203 can be used to train the neural network 104 to improve the determining of at least one of the depth map 101, the relative pose 102, the semantic segmentation 103.

Optionally, the learnable mask 203 is a semantic inlier mask. The semantic inlier mask indicates a region of interest in the synthesized image 107 by means of semantic segmentation information in the semantic inlier mask.

As it is further illustrated in FIG. 2, in the training phase the device 100 optionally determines segmentation information 204 based on the synthesized image 107. The segmentation information 204 allows to associate portions of the synthesized image 107 with labels, e.g. indicating that a predefined portion is at least one of a street, a wall, a tree, a traffic light, a sidewalk, a sky, a house. The device 100 then generates the learnable mask 203 based on the segmentation information 204 and based on the synthesized image 107.

As it is further illustrated in FIG. 2, in the training phase the device 100 optionally determines pose information 205 based on the synthesized image 107. The pose information 205 e.g. comprises location coordinates and angles for an orientation of a vehicle. The device 100 then generates the learnable mask 203 based on the pose information 205 and based on the synthesized image 107.

As it is further illustrated in FIG. 2, in the training phase the device 100 optionally determines an inlier mask 206 based on the synthesized image 107. In particular, estimated semantic segmentation information is incorporated within the inlier mask 206 for training a selfsupervised depth estimation model of the neural network 104. The device 100 then generates the learnable mask 203 based on the inlier mask 206 and based on the synthesized image 107.

In particular, training the neural network 104 based on the inlier mask 206 and based on the segmentation information 204 enables to improve the determining of a depth map 101 based on an input image 105 by the device 100. Two further discriminators (not shown in FIG. 2) can support to shift the real images domain and the corresponding distribution of estimated depth and semantic segmentation towards the virtual domain.

In other words, the device 100 enables to train the neural network 104 with the additional support of two discriminators 201, 202 to shift a domain from real to virtual based on estimated depth maps and semantic segmentation results of a synthesized image 107.

Thereby, the features disclosed FIG. 1 and FIG. 2 can reduce the domain gap between virtual image 109 and real images 108 (or a virtual and a real dataset) based on a self-supervised approach to improve the determination of depth maps 101, semantic segmentation 103 and relative-pose estimation 102.

FIG. 3 shows a schematic view of a device 100 according to an embodiment of the present disclosure in even more detail. The device 100 shown in FIG. 3 comprises all features and functionality of the device 100 of FIG. 1 and FIG. 2, as well as the following optional features:

In view of FIG. 3, the generator 106 (which can also be called domain adaptation module or generator block) is now going to be described in more detail. The generator 106 can be configured to train a model (e.g. the neural network 104) to create novel domain images with common features (e.g. the synthesized images 107) based on the virtual image domain and the real image domains. An adversarial loss function (i.e. the loss function 110) can be used along with an image semantic edge based loss function and with two discriminator networks (i.e. the first discriminator 201 and the second discriminator 202), one for the virtual and the other one for real domain.

During a process of learning, the first and the second discriminator 201, 202 are used. To reproduce an input image 105, reconstruction loss is used as in the estimated common domain RGB image of the neural network 104. The input images 105 can be from the virtual or the real domain in a random pattern. An output of the generator 106 passes through a semantic gradient or edge based loss function (i.e. the loss function 110) and through one of the first or second discriminator 201, 202. The selection of the discriminator is completely random. The reason to choose only one discriminator is to compute the results by the generator 106 based on the judgement provided by the chosen discriminator, which helps to mix the texture information of the real image 108 and the virtual image 109 by keeping scene structure and edges safely. In addition, the randomness of choosing discriminators or choosing the dataset (either a real or a virtual image) helps the neural network 104 to not go into a local minima, instead the loss will be fluctuating as this might reach to global minima.

In view of FIG. 3, the loss function 110 (which can also be called image gradient based loss function) is now going to be described in more detail. To preserve the image gradient, edges, or scene structure, a robust loss function 110 is provided, which is completely based on image semantic edges. Semantic edge based learning allows the generator 106 to produce images that maintain semantic gradient or edge information from the beginning of training and at the same time ensure a higher variation of texture on the images (i.e. the synthesized image 107) that are generated to train depth estimation (DE), semantic segmentation (SS) and relative pose estimation (RPE) models. Thereby, computer vision application models are trained by considering the semantic edges as a common important feature being used to train the models (DE, SS, RPE. Convolutions of the DE, SS, and RPE model allows to learn computer vision applications independent of varying texture, shadows, lighting effects, weather conditions, or color filtering.

In view of FIG. 3, the first discriminator 201 and the second discriminator 202 (which can also be part of the domain adaptation module) are now going to be described in more detail. For the first and/or the second discriminator 201, 202, a Wasserstein Discriminator can be used, which uses an Earth-Mover’s distance to minimize the discrepancy between the distributions for the virtual dataset (i.e. the virtual image 109) and the real dataset (i.e. the real images 108). Additionally, a gradient penalty can be applied by at least one of the discriminators 201, 202 to overcome the problem of vanishing or exploding gradients. The purpose of the first discriminator 201 can be to learn if the input of the neural network 104 is from a virtual dataset or not. During training based on the first discriminator 201, virtual images 109 are considered as true/real while the output of the generator 106 is always considered false/fake. The purpose of the second discriminator 202 is similar to the first discriminator 201 but operates exactly the other way round (i.e. oppositely). The second discriminator 202 always considers real images 108 as true/real and the output of the generator 106 as false/fake.

When training the neural network 104, the output of the generator 106 is always considered as false/fake and randomly one of the discriminators 201, 202 is chosen to judge if the generated RGB (i.e. the synthesized image 107) is virtual or real. By confusing the generator 106 and the first and the second discriminator 201, 202, the generator 106 will produce images with mixed texture and features.

In a real world scenario, a depth map 101 is a representation of a 3D scene structure projected as a pattern depending on camera sensor parameters and semantic information. For processing in the device 100, semantic information can be purely the pattern of 3D scene, edges, and shapes of objects.

As illustrated in FIG. 3, the device 100 optionally may comprise a third discriminator 301 and a fourth discriminator 302. The third discriminator 301 may operate on an estimated depth map 101, while the fourth discriminator 302 may operate on an estimated semantic segmentation. The purpose of the third discriminator 301 and the fourth discriminator 302 is to further improve determining the depth map 101 and the semantic segmentation 103 based on input images 105.

FIG. 4 shows a portion of an input image 401 and two portions of depth maps 402, 403 which are determined based on the input image 401 by a device which only has been trained on virtual images 109. As can be seen when comparing sections 402 and 403 of FIG. 4, certain holes and irregularities are present in the estimated depth maps 402, 403 of real images (where the device has been trained on a virtual dataset alone). In such a case, the third and fourth discriminator 301, 302 help to improve the estimated depth map or semantic segmentation based on input texture (by training the device 100 both on a virtual or a real dataset). Thereby, the generator 106 is forced to produce better synthesized images 107 having common features of the virtual and real domain.

Turning back to FIG. 3, it is now described how the neural network 104 is trained based on the output of generators for segmentation, depth net or pose and mask net, i.e. based on the segmentation information 204, depth information, the pose information 205 and the inlier mask 206. As a supervised approach for creating ground truths for improving the depth map 101, the relative pose 102 and the semantic segmentation 103 determination, a self-supervised approach based on geometry and view reconstruction between images can be applied. These images e.g. can be from stereo or sequential images or images from mapping data.

Using the segmentation information 204 supports creating an efficient learnable mask 203. In general, outliers like occlusion, non-overlapping or around edges are tough regions for the neural network 104 when using a view reconstruction loss function so as to warp a network input into a stereo image, sequential image, or map-relative image. By using semantic edges as a loss function 110 an by further training based on the segmentation information 204, the neural network 104 can be trained to learn further and improve the inlier mask 206 and the learnable mask 203 for producing better depth maps 101 with the help of view reconstruction loss.

In view of FIG. 5, FIG. 6 and FIG. 7, various ways of training the neural network 104 are now described.

FIG. 5 illustrates using virtual images 109 (i.e. a virtual dataset) for training with depth maps 101 and semantic segmentation 103, e.g. while using real video (i.e. sequential images) to run a self- supervised approach (not illustrated in FIG. 5). The main benefits of this approach is a continuous learning process. While the virtual dataset is always available for training, the real video is available on the fly to further improve the training. Thereby, depth or segmentation models can be enriched with precise virtual labels. In parallel, determination results of the device 100 are improved based on real images 108 taking learning results of virtual images as a base for using the third and fourth discriminators 301, 302.

FIG. 5 in particular shows the active components of the device 100 for training based on virtual images 109. As the virtual images 109 can produce labels such as depth maps 101 and semantic segmentation 103, they are used for training the neural network 104 with the loss function 110. FIG. 6 illustrates a network architecture of the device 100 for training using sequential real images 108. In particular, the active components for training real video (sequential real images 108) are shown. As training is based on real video, a self- supervised approach is used (based on a view reconstruction loss function) to compute depth maps 101 and a relative-pose 102 (odometry) and additionally, the estimated semantic segmentation 103 is used (only forward propagation) to improve the inlier mask 206 and/or the learnable mask 203. Moreover, only virtual images 109 (dataset) are used for training with depth maps 101 and semantic segmentation 102, whereas on the real images 108, the trained depth and segmentation models are forward passed and adapt the estimated results on real images 108 to be similar to results estimated based on virtual images 109. This is supported by using the third and fourth discriminators 301, 302 to shift the domain from real to virtual only in the estimated depth and segmentation results.

FIG. 7 illustrates a network architecture of the device 100 for training the neural network 104 based on a single real image 108 without any loss functions except discriminators to train a model for depth maps and semantic segmentation. More specifically, the network architecture is trained based on real images 108 (dataset) with no depth related loss function. In the illustrated example, the depth maps on real images adapt to produce better results based on the third and fourth discriminator 301, 302. The shown approach allows to use stereo virtual images 109 and stereo real images 108 without any ground truth labels, to estimate a depth map 101 by using a self- supervised approach. The shown approach also allows to use stereo virtual images 109 and sequential real images 108 without any ground truth labels, to estimate depth maps 101 by using a self-supervised approach.

FIG. 8 illustrates an inference phase (also called test phase) of the device 100. In the inference phase as illustrated, depth maps 101 and semantic segmentation 103 for a given input image 105 from the real domain are determined. The device 100 also allows to determine a relative pose 102 (which is not shown in FIG. 8).

FIG. 9 shows a schematic view of a method 900 according to an embodiment of the present disclosure. The method 900 is for determining a depth map 101, a relative pose 102, or a semantic segmentation 103. The method 900 comprises a step of, in an inference phase, determining 901, by a neural network 104, the depth map 101, the relative pose 102, or the semantic segmentation 103 based on an input image 105. The method 900 further comprises a step of, in a training phase, generating 902, by a generator 106, a synthesized image 107 based on a real image 108 or a virtual image 109, and based on a loss function 110. The method 900 further comprises a step of training 903, by the generator 106, the neural network 104 based on the synthesized image 107; wherein the loss function 110 comprises a semantic edge function 111.

The present disclosure has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed disclosure, from the studies of the drawings, this disclosure, and the independent claims. In the claims as well as in the description, the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. A device (100) for determining a depth map (101), a relative pose (102), or a semantic segmentation (103), wherein the device (100) comprises:

- a neural network (104) configured to, in an inference phase, determine the depth map (101), the relative pose (102), or the semantic segmentation (103) based on an input image (105); and

- a generator (106) configured to, in a training phase, generate a synthesized image (107) based on a real image (108) or a virtual image (109), and based on a loss function (110), and train the neural network (104) based on the synthesized image (107); wherein the loss function (110) comprises a semantic edge function (111).

2. The device (100) according to claim 1, wherein the semantic edge function (111) is configured to maintain semantic gradient information and/or edge information in the synthesized image (107).

3. The device (100) according to claim 1 or 2, further comprising a first discriminator (201) and a second discriminator (202), wherein the generator (106) is further configured to, in the training phase, provide the synthesized image (107) to the first discriminator (201) or to the second discriminator (202), to train the neural network (104).

4. The device (100) according to claim 3, wherein the generator (106) is further configured to, in the training phase, train the neural network (104) based on a determination result of the discriminator (201, 202) to which the synthesized image (107) was provided.

5. The device (100) according to claim 3 or 4, wherein the generator (106) is further configured to, in the training phase, randomly provide the synthesized image (107) to the first discriminator (201) or to the second discriminator (202).

6. The device (100) according to claim 4 or 5, wherein the first discriminator (201) is further configured to, in the training phase, determine that a synthesized image (107) that is generated by the generator (106) based on a virtual image (109), is a fake, and determine that the virtual image (109) is an original; and wherein the second discriminator (202) is further configured to, in the training phase, determine that a synthesized image (107) that is generated by the generator (106) based on a real image (108), is a fake, and determine that the real image (108) is an original.

7. The device (100) according to any one of the preceding claims, further configured to train the neural network (104), based on the synthesized image (107), for determining at least one of a depth map (101), a relative pose (102), a semantic segmentation (103).

8. The device (100) according to any one of the preceding claims, further configured to, in the training phase, generate a learnable mask (203) based on the synthesized image (107); and train the neural network (104) based on the learnable mask (203).

9. The device (100) according to claim 8, wherein the learnable mask (203) is a semantic inlier mask.

10. The device (100) according to claim 8 or 9, further configured to train the neural network (104), based on the learnable mask (203), for determining at least one of: a depth map (101), a relative pose (102), a semantic segmentation (103).

11. The device (100) according to any one of claims 8 to 10, further configured to, in the training phase, determine segmentation information (204) based on the synthesized image (107) and generate the learnable mask (203) based on the segmentation information (204).

12. The device (100) according to any one of claims 8 to 11, further configured to, in the training phase, determine pose information (205) based on the synthesized image (107) and generate the learnable mask (203) based on the pose information (205).

13. The device (100) according to any one of claims 8 to 12, further configured to, in the training phase, determine an inlier mask (206) based on the synthesized image (107) and generate the learnable mask (203) based on the inlier mask (206).

14. A method (900) for determining a depth map (101), a relative pose (102), or a semantic segmentation (103), the method (900) comprising the steps of:

- in an inference phase, determining (901), by a neural network (104), the depth map (101), the relative pose (102), or the semantic segmentation (103) based on an input image (105); and - in a training phase, generating (902), by a generator (106), a synthesized image (107) based on a real image

(108) or a virtual image (109), and based on a loss function (110), and training (903), by the generator (106), the neural network (104) based on the synthesized image (107); wherein the loss function (110) comprises a semantic edge function (111).

15. A computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method (900) of claim 14.