CN116964641A

CN116964641A - Domain-adaptive semantic segmentation

Info

Publication number: CN116964641A
Application number: CN202180094958.XA
Authority: CN
Inventors: 沈枫易; 奥纳伊·优厄法利欧格路
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-03-16
Filing date: 2021-06-18
Publication date: 2023-10-27
Also published as: WO2022194398A1

Abstract

The present invention relates to semantic image segmentation using trainable models such as neural networks. A trainable device is presented comprising a shared encoder network connected to a first generator network, a second generator network and a semantic segmentation network. In a training phase, an input image of a training image set comprising a marked source domain image and an unmarked target domain image is provided to a shared encoder network, which trains based on the input image and outputs a first feature map. Then, the first generator network and the second generator network train based on the first feature map, one of the first generator network and the second generator network converting the input image in the output domain. The shared encoder network is further trained based on the domain transformed input image and outputs a second feature map. The semantic segmentation network is trained based on the input image and the domain transformed input image.

Description

Domain-adaptive semantic segmentation

Technical Field

The invention relates to the field of semantic image segmentation using trainable devices and models such as deep neural networks. The invention provides a trainable device and a corresponding method capable of performing semantic image segmentation. The architecture and training method of the device are specifically designed to train the device using a domain adaptive semantic segmentation process, where the device is trained with training images from different domains.

Background

Trainable devices such as deep neural networks have shown great potential in coping with computer vision challenges such as semantic image segmentation or image recognition. Semantic image segmentation is a basic building block, for example, in an autopilot system, which refers to the task of classifying each pixel of an image as belonging to a certain class. Traditionally, a large number of pixel annotation tags have to be created manually for a training image set before training a semantic segmentation model (of a trainable device), which can be a time consuming and costly process.

With the benefit of the development of modern computer graphics technology, such as game simulators, virtual images can be used to closely mimic the appearance of real world objects. Furthermore, fine-grained semantic image labels can be automatically acquired on a large scale and are not affected by these virtual environments. This opens a new research gate for computer vision applications. However, brightness and texture differences between images of different domains (in particular, virtual images and real world images) may still pose limitations to guiding knowledge transfer. A well-trained model on a virtual source data set typically suffers from a dramatic drop in performance when applied over a real target domain of tag loss.

To reduce this cross-domain variance, a simple approach is to compare the mean and variance of feature representations between different domain images, thereby aligning the statistical properties of the source and target domain features. However, domain gaps tend to be reflected on semantic and pixel levels, rather than statistical levels.

In view of the above, after satisfactory results are shown in pixel-level image style conversion, image-to-image conversion models have attracted a great deal of attention to unsupervised domain adaptation. That is, an image may be transferred from one domain to another. For example, thus, target domain class images (e.g., target domain refers to unlabeled images for which a device or model should be trained) may be utilized that are converted from labeled source domain images and their labels to provide guidance in learning target domain segmentation. However, while the object class image is visually pleasing, important features may be lost during image conversion because the pure image-to-image conversion model is not constructed to convey semantic information. Furthermore, it may not guarantee a perfect mapping from the source domain to the target domain. Thus, the performance gain of semantic segmentation is limited to target domain data only, even when trained using such domain adaptation procedures.

Some methods use a domain adaptation process that introduces a convolution discriminator in the split output to learn the structural distribution of the source split map, forcing the target domain output to look more like the source domain output. This may help to improve the semantic segmentation performance of the target domain image to some extent. However, the knowledge of adaptability may not yet be fully explored and transferred from the source domain to the target domain.

Disclosure of Invention

The invention and its embodiments are further based on the following considerations.

Learning from a virtual world dataset is of great interest for real world applications (e.g., semantic image segmentation) due to the difficulty in acquiring ground truth labels. It can be shown that training in combination with labeled virtual world data (source domain) and unlabeled real world data (target domain) can achieve satisfactory results. From a domain adaptation perspective, a key challenge is to narrow the domain gap between two domains in order to benefit from virtual data.

In view of this, embodiments of the present invention aim to provide an improved trainable device for semantic image segmentation. The device and method according to the invention enable to reduce the domain gap between the source domain and the target domain of an image when training a trainable device, and according to the invention a new method is provided for training a device using both the source domain image and the target domain image. Thus, the goal is to benefit from a virtual source domain image of a better trained device that can more accurately perform semantic image segmentation on the target domain image when training is complete.

These and other objects are achieved by embodiments of the invention as described in the appended independent claims. Advantageous implementations of these embodiments are further defined in the dependent claims.

In the present invention, a trigeminal structure is presented for a trainable unit. Thus, the focus of the present invention is on intra-domain reconstruction and cross-domain transformation in order to learn domain independent features from a training image set comprising both source domain images and target domain images.

A first aspect of the invention provides a trainable device for semantic image segmentation, the device comprising a shared encoder network connected to a first generator network, a second generator network and a semantic segmentation network, respectively, the device being arranged to, during a training phase: obtaining a training image set comprising a first subset of labeled source domain images and a second subset of unlabeled target domain images; providing a source domain image or a target domain image as an input image to the shared encoder network; training the shared encoder network based on the input image, wherein the shared encoder network is configured to output a first feature map; training the first and second generator networks based on the first feature map, wherein one of the first and second generator networks is used to output a domain converted input image; further training the shared encoder network based on the domain transformed input image, wherein the shared encoder network is configured to output a second feature map; the semantic segmentation network is trained based on the input image and the domain transformed input image.

The apparatus according to the first aspect comprises the shared encoder network, which may be forced to meet both source domain constraints and target domain constraints. This helps to reduce domain specific bias, thereby reducing domain gaps.

The device according to the first aspect may be trained using both source domain images and target domain images, which makes the trained device perform better in semantic image segmentation tasks on target domain images. That is, the trained device provides more accurate segmentation results for the target domain input image.

In an implementation form of the first aspect, the other of the first generator network and the second generator network is configured to output a first reconstructed input image when the first generator network and the second generator network are trained based on the first feature map.

In an implementation manner of the first aspect, the trainable device is further configured to, during the training phase: the other of the first generator network and the second generator network is further trained based on the second feature map, wherein the other of the first generator network and the second generator network is used to output a second reconstructed input image.

In an implementation manner of the first aspect, the first reconstructed input image and/or the second reconstructed image is obtained by minimizing reconstruction losses.

The generator network for reconstructing the input image (i.e. generating the first reconstructed image and the second reconstructed image) learns the data distribution that should be placed on its output image. Thus, when the generator network subsequently acts as an image converter for the input image, the generator network may produce a more efficient domain conversion from one domain to another.

In an implementation manner of the first aspect, the semantic segmentation network is configured to obtain deeper semantic features based on the first feature map and the second feature map, and minimizing a semantic feature cyclic reconstruction loss includes minimizing feature distances between the deeper semantic features.

In this way, the first and second generator networks may be improved to carry higher levels of semantic information while further supporting that the shared encoder network and the semantic segmentation network are domain independent for the input image and its domain transformations.

In an implementation form of the first aspect, the first generator network comprises a source domain generator network, the second generator network comprises a target domain generator network, and if the input image is a source domain input image selected from the first subset, the device is configured to train the first generator network and the second generator network based on the first feature map by: training the source domain generator network based on the first feature map, wherein the source domain generator network is used for outputting the reconstructed input image; training the target domain generator network based on the first feature map, wherein the target domain generator network is configured to output the domain transformed input image.

This describes the scenario of the training device when the input image comes from the source domain.

In an implementation form of the first aspect, the trainable device is configured to train the first generator network and the second generator network based on the second feature map by: training the source domain generator network based on the second feature map, wherein the source domain generator network is configured to output the second reconstructed input image.

This may be referred to as a cyclic reconstruction of the input image. That is, the second reconstructed image may be a cyclically reconstructed input image.

In an implementation form of the first aspect, the semantic segmentation network is configured to minimize segmentation loss based on the labels of the source domain input image, the domain conversion input image, and the source domain input image when training based on the source domain input image and the domain conversion input image.

In this way, the trainable device may be trained to perform more accurate semantic image segmentation.

In an implementation form of the first aspect, the first generator network comprises a source domain generator network, the second generator network comprises a target domain generator network, and if the input image is a target domain input image selected from the second subset, the device is configured to train the first generator network and the second generator network based on the first feature map by: training the source domain generator network based on the first feature map, wherein the source domain generator network is used for outputting the domain conversion input image; training the target domain generator network based on the first feature map, wherein the target domain generator network is configured to output the reconstructed input image.

A scenario is described in which a device is trained when an input image is from a target domain. The input image may be alternately selected from the source domain and the target domain.

In an implementation form of the first aspect, the apparatus is configured to train the first generator network and the second generator network based on the second feature map by: training the target domain generator network based on the second feature map, wherein the target domain generator network is configured to output the second reconstructed input image.

In an implementation form of the first aspect, the semantic segmentation network is configured to provide a segmentation result of the target domain input image to a segmentation discriminator of the trainable device, the segmentation discriminator being configured to compare the segmentation result with a previous segmentation result of a source domain input image.

Ideally, by alternating the source domain image and the target domain image marked to the shared encoder network, a symmetric training process will be achieved. However, unlike the marked source domain image, the target domain image has no label available yet, and therefore, perfectly symmetrical training is not possible. To compensate for this, a segmentation discriminator is used when the input image comes from the target domain.

In an implementation manner of the first aspect, the trainable device is configured to, during the training phase: each of the source domain images of the first subset and each of the target domain images of the second subset are provided to the shared encoder network as a sequence of input images, wherein the source domain images alternate with the target domain images in the sequence of input images.

In this way, domain adaptive training of the trainable device may be performed with optimal results.

In an implementation form of the first aspect, the trainable device is further configured to, after each source domain image of the first subset and each target domain image of the second subset has been provided to the shared encoder network: operating the semantic segmentation network based on each target domain image of the second subset, wherein the semantic segmentation network is configured to generate a pseudo tag for each target domain image of the second subset to obtain a second subset of pseudo tags for the pseudo tagged target domain image; the apparatus is further trained by providing each of the source domain images of the first subset and each of the target domain images of the pseudo-markers of the second subset of pseudo-markers as another input image sequence to the shared encoder network, wherein the source domain images alternate with the target domain images of the pseudo-markers in the other input image sequence.

This describes the second phase of the training process, followed by the first phase (completed after each of the source domain images of the first subset and each of the target domain images of the second subset have been provided to the shared encoder network). The most advanced results are obtained on the reference dataset of domain-adaptive semantic image segmentation by combining a two-stage training process involving self-induced tag generation (pseudo-tags).

The source domain image and its label may be the same throughout. Everything in the first phase can be repeated in the second phase. The only difference may be that now in the second stage a pseudo tag is obtained for the target domain image and the semantic segmentation network may also be trained based on the target domain input image and the domain transformed target domain input image and using the pseudo tag.

In one implementation of the first aspect, the marked source domain image is a segmented marked virtual world image and the unmarked target domain image is an unmarked real world image.

For example, the source domain may employ the virtual data set GTA5 and Synthia, and the target domain may employ the real world data set Citysacpes. The virtual dataset may provide a marker image without the need for manual markers. That is, the trainable device of the first aspect is for unsupervised domain adaptive semantic segmentation.

In an implementation form of the first aspect, the first feature map and/or the second feature map comprises domain independent segmentation features.

In an implementation manner of the first aspect, the trainable device is further configured to, in an inference phase: receiving any target domain image; operating the shared encoder network based on the received target domain image, wherein the shared encoder network is configured to output a feature map; and operating the semantic segmentation network based on the third feature map, wherein the semantic segmentation network is used for outputting a segmentation result of the target domain image.

After training the two phases described above, the trainable device adapts to the target domain and is thus able to segment all target domain images.

A second aspect of the invention provides a method for training a device for semantic image segmentation, the device comprising a shared encoder network connected to a first generator network, a second generator network and a semantic segmentation network, respectively, the method comprising, in a training phase of the device: obtaining a training image set comprising a first subset of labeled source domain images and a second subset of unlabeled target domain images; providing a source domain image or a target domain image as an input image to the shared encoder network; training the shared encoder network based on the input image, wherein the shared encoder network outputs a first feature map; training the first and second generator networks based on the first feature map, wherein one of the first and second generator networks output domain converts an input image; further training the shared encoder network based on the domain transformed input image, wherein the shared encoder network outputs a second feature map; the semantic segmentation network is trained based on the input image and the domain transformed input image.

In one implementation of the second aspect, the other of the first generator network and the second generator network outputs a first reconstructed input image when the first generator network and the second generator network are trained based on the first feature map.

In one implementation manner of the second aspect, the method further includes, during the training phase: the other of the first generator network and the second generator network is further trained based on the second feature map, wherein the other of the first generator network and the second generator network outputs a second reconstructed input image.

In one implementation of the second aspect, the first reconstructed input image and/or the second reconstructed image is obtained by minimizing reconstruction losses.

In an implementation manner of the second aspect, the semantic segmentation network obtains deeper semantic features based on the first feature map and the second feature map, and minimizing a semantic feature cyclic reconstruction loss includes minimizing feature distances between the deeper semantic features.

In one implementation of the second aspect, the first generator network comprises a source domain generator network, the second generator network comprises a target domain generator network, and if the input image is a source domain input image selected from the first subset, the method comprises training the first generator network and the second generator network based on the first feature map by: training the source domain generator network based on the first feature map, wherein the source domain generator network outputs the reconstructed input image; training the target domain generator network based on the first feature map, wherein the target domain generator network outputs the domain transformed input image.

In one implementation of the second aspect, the method includes training the first generator network and the second generator network based on the second feature map by: training the source domain generator network based on the second feature map, wherein the source domain generator network outputs the second reconstructed input image.

In one implementation manner of the second aspect, the semantic segmentation network minimizes segmentation loss based on the labels of the source domain input image, the domain conversion input image and the source domain input image when training based on the source domain input image and the domain conversion input image.

In one implementation of the second aspect, the first generator network comprises a source domain generator network, the second generator network comprises a target domain generator network, and if the input image is a target domain input image selected from the second subset, the method comprises training the first generator network and the second generator network based on the first feature map by: training the source domain generator network based on the first feature map, wherein the source domain generator network outputs the domain transformed input image; training the target domain generator network based on the first feature map, wherein the target domain generator network outputs the reconstructed input image.

In one implementation of the second aspect, the method includes training the first generator network and the second generator network based on the second feature map by: training the target domain generator network based on the second feature map, wherein the target domain generator network outputs the second reconstructed input image.

In one implementation of the second aspect, the semantic segmentation network provides a segmentation result of the target domain input image to a segmentation discriminator of the trainable device, which compares the segmentation result with a previous segmentation result of a source domain input image.

In one implementation of the second aspect, the method comprises, during the training phase, providing each of the source domain images of the first subset and each of the target domain images of the second subset as a sequence of input images to the shared encoder network, wherein the source domain images alternate with the target domain images in the sequence of input images.

In one implementation of the second aspect, the method further comprises, after each source domain image of the first subset and each target domain image of the second subset has been provided to the shared encoder network: operating the semantic segmentation network based on each target domain image of the second subset, wherein the semantic segmentation network is configured to generate a pseudo tag for each target domain image of the second subset to obtain a second subset of pseudo tags for the pseudo tagged target domain image; the apparatus is further trained by providing each of the source domain images of the first subset and each of the target domain images of the pseudo-markers of the second subset of pseudo-markers as another input image sequence to the shared encoder network, wherein the source domain images alternate with the target domain images of the pseudo-markers in the other input image sequence.

In one implementation of the second aspect, the marked source domain image is a segmented marked virtual world image and the unmarked target domain image is an unmarked real world image.

In an implementation manner of the second aspect, the first feature map and/or the second feature map includes domain independent segmentation features.

In one implementation manner of the second aspect, the method further includes, in an inference phase: receiving any target domain image; operating the shared encoder network based on the received target domain image, wherein the shared encoder network is configured to output a feature map; and operating the semantic segmentation network based on the third feature map, wherein the semantic segmentation network is used for outputting a segmentation result of the target domain image.

The method according to the second aspect of the invention may be performed by the device according to the first aspect of the invention. Other features and implementations of the method according to the second aspect of the invention correspond to features and implementations of the device according to the first aspect of the invention.

The method according to the second aspect of the invention may be extended to an implementation corresponding to an implementation of the device according to the first aspect of the invention. Accordingly, an implementation of the method includes one or more features of a corresponding implementation of the device.

The advantages of the method according to the second aspect of the application are the same as those of the corresponding implementation of the device according to the first aspect of the application.

A third aspect of the present application provides a computer program comprising program code for performing the method according to the second aspect or its implementation form when executed on a computer.

A fourth aspect of the present application provides a non-transitory storage medium storing executable program code which, when executed by a processor, performs a method according to the second aspect or any one of its implementation forms.

The program code may be adapted to implement the network and training procedure of the trainable device of the first aspect as described above.

It should be noted that all the devices, elements, units and modules described in the present application may be implemented in software or hardware elements or any kind of combination thereof. All steps performed by the various entities described in the present application and functions to be performed by the various entities described are intended to mean that the respective entities are adapted to perform the respective steps and functions. Although in the following description of specific embodiments, specific functions or steps performed by external entities are not reflected in the description of specific detailed elements of the entity performing the specific steps or functions, it should be clear to a skilled person that these methods and functions may be implemented by corresponding hardware or software elements or any combination thereof.

Drawings

The aspects and implementations described above will be described in the following description of specific embodiments with reference to the accompanying drawings, in which:

fig. 1 shows a trainable device provided by an embodiment of the present invention.

Fig. 2 shows a trainable device provided by an embodiment of the present invention.

Fig. 3 shows a trainable device provided by an embodiment of the present invention in which a source domain image is provided as an input image.

Fig. 4 shows a trainable device provided by an embodiment of the present invention in which a target domain image is provided as an input image.

Figure 5 shows the qualitative results of the GTA5 to Cityscapes adaptation on the Cityscapes validation set. From columns (a) to (d), fig. 5 shows a target domain input image; ground real-phase labels; only segmentation prediction of the source model; embodiments of the present invention provide for split prediction (domain adaptation) of devices.

Fig. 6 shows a method of training a device provided by an embodiment of the invention.

Detailed Description

Fig. 1 shows a trainable device 100 provided by an embodiment of the present invention. The trainable device 100 may be trained and configured for semantic image segmentation, for example, for performing semantic image segmentation on images from a target domain (e.g., untagged real world images). When the device 100 has been sufficiently trained, the device 100 may advantageously perform accurate semantic image segmentation on these images, as described below. Typically, during the training phase, the device 100 is provided with training data, a training image with markers from a source domain (e.g., a virtual world image of the markers) (i.e., such training image is provided along with one or more segmentation labels associated with the training image) and an unlabeled image from a target domain.

The device 100 comprises a shared encoder network 101 connected to a first generator network 102, a second generator network 103 and a semantic segmentation network 104, respectively. Each network 101, 102, 103, 104 of the device 100 may be implemented by a trainable model, for example by a deep neural network. For example, the shared encoder network 101 may be implemented by a residual convolutional neural network (convolutional neural network, CNN), which in particular may be or be based on res net-101, as known in the art. Furthermore, the semantic segmentation network 104 may be implemented by a CNN, which may be, in particular, or be based on, a deep neural network, such as the deep lab-v2 network known in the art.

During the training phase of device 100, networks 101, 102, 103, and 104, which may be trained individually, are trained using a training image set. That is, the training image set is provided to the apparatus 100 in an image-by-image manner, and the apparatus 100 operates on each image. For example, training images are provided to the shared encoder network 101. Furthermore, known training techniques may be applied during training the device 100 and its networks 101, 102, 103, 104 based on training the image set, such as back propagation, iterative optimization, etc.

Subsequently, in the inference phase of the device 100, i.e. after the device 100 has been trained, a target domain image (e.g. a real world image) may be provided to the device 100, and the trained device 100 may output the segmentation result of the target domain image. The segmentation result may include one or more labels associated with the target domain image, where each label may indicate a categorized segment of the image, or may indicate a category for each pixel of the image (e.g., the label may include the categories "street", "person", "building", "tree", etc., and the label may also indicate its location in the image, e.g., which pixel or pixels).

In a training phase, the apparatus 100 is used to obtain a training image set, wherein the training image set comprises a first subset of labeled source domain images and a second subset of unlabeled target domain images. The marked source domain image may be a segmented marked virtual world image and the unmarked target domain image may be an unmarked real world image. The source field image may be provided from a computer program for implementing modern graphics technology such as video games or game simulators. The computer program may output a source domain image and a corresponding label. For example, fine-grained semantic image labels may be provided on a large scale by such a computer program or virtual environment. Thus, unsupervised learning is possible, i.e. a human does not need to provide an image of the tag and/or label.

The trainable device 100 is further configured to provide a source domain image or a target domain image of the training image set as an input image 105 to the shared encoder network 101 while still in the training phase. That is, the input image 105 is input into the shared encoder network 101 so that the shared encoder network 101 can operate on the input image 105.

The trainable device 100 is for training the shared encoder network 101 based on the input image 105, wherein the shared encoder network 101 is for outputting the first feature map 106. Training the shared encoder network 101 includes operating the shared encoder network 101 on the input image 105 to generate a first feature map 106. It should be noted that the shared encoder network 101 may generally be used to output a signature as it receives and operates on an image.

The trainable device 100 is then used to train the first generator network 102 and the second generator network 103 based on the first feature map 106. Thus, one of the first generator network 102 and the second generator network 103 is used to output a domain converted input image 107. Thus, the domain converted input image 107 may be associated with the input image 105. For example, in the case where the first generator network 102 and the second generator network 103 are a source domain generator network and a target domain generator network (to be explained in more detail below), respectively, if the input image 105 is a target domain image, the source domain generator network outputs a domain-converted input image 107, and if the input image 105 is a source domain image, the target domain generator network outputs the domain-converted input image 107.

The trainable device 100 is then configured to further train the shared encoder network 101 based on the domain transformed input image 107, wherein the shared encoder network 101 may be configured to output a second feature map 306 (see e.g. fig. 3). Further training of the shared encoder network 101 includes operating the shared encoder network 101 on the domain transformed input image 107, thereby generating a second feature map 306.

The trainable device 100 is then used to train the semantic segmentation network 104 based on the input image 105 and the domain transformed input image 107. For example, the semantic segmentation network 104 may operate based on the input image 105 and the domain transformed input image 107 to output corresponding segmentation results.

The trainable device 100 may include a processor or processing circuit (not shown) for performing, conducting, or initiating various operations of the trainable device 100 described herein to implement and operate the various networks 101, 102, 103, 104 of the trainable device 100.

The processing circuitry may comprise hardware and/or the processing circuitry may be controlled by software. The hardware may include analog circuits or digital circuits, or both analog and digital circuits. The digital circuitry may include components such as application-specific integrated circuits (ASIC), field-programmable gate arrays (FPGA), digital signal processors (digital signal processor, DSP), or multi-purpose processors. The trainable device 100 may also include memory circuitry that stores one or more instructions executable by the processor or processing circuitry (e.g., under control of software). For example, the memory circuit may include a non-transitory storage medium storing executable software code that, when executed by a processor or processing circuit, causes the trainable device 100 to perform various operations. In one embodiment, a processing circuit includes one or more processors and a non-transitory memory coupled to the one or more processors. The non-transitory memory may carry executable program code that, when executed by the one or more processors, causes the trainable device 100 to perform, or initiate the operations or methods described herein, in particular the operations or methods of the networks 101, 102, 103, 104 of the trainable device 100.

In the present invention, a new approach for solving the domain-adaptive semantic segmentation problem is provided by the trainable device 100 described above. The trainable device 100 is designed based on the assumption that a shared feature map (generated using the shared encoder network 101) that can help (simultaneously) generate source domain outputs and target domain outputs should be independent of any input domain.

Fig. 2 illustrates a trainable device 100 constructed on the embodiment illustrated in fig. 1 provided by an embodiment of the present invention. Like elements in fig. 1 and 2 are labeled with like reference numerals and may be implemented in the same manner.

As shown in fig. 2, the trainable device 100 is designed according to a so-called "trigeminal adaptive framework", i.e. a shared encoder network "module" 101 is connected in a trigeminal structure to generator networks 102, 103 ("source module" and "target module") and semantic segmentation network 104 ("target module"). Thus, the shared encoder network 101 is forced to satisfy both source and target constraints. This will result in a domain invariant representation of the learning sharing feature. The architecture of the device 100 may be implemented using a two-stage continuous self-training process. In this way, the trainable device 1000 may benefit from an intermediate selectable marker that may be obtained from the first stage for the second stage.

Fig. 3 and 4 illustrate a trainable device 100 provided by an embodiment of the present invention, which is based on the embodiment illustrated in fig. 1 and 2, but includes specific implementation details. Like elements in fig. 1 and 3/4 are labeled with like reference numerals and may be implemented in the same manner.

As described above, the training dataset (i.e., training image set) may be comprised of source domain images (first subset of training images) with segmented labels and label-free target images (second subset of images). Fig. 3 and 4 show the case when the source domain image (Xs) from the first subset is provided as an input image 105' to the device 100 (fig. 3) and when the target domain image (Xt) from the second subset is provided as an input image 105″ to the device 100 (fig. 4), respectively. During the process of training the trainable device 100, each of the source domain images of the first subset and each of the target domain images of the second subset may be provided to the device 100 as a sequence of input images 105, in particular the shared encoder network 101. Thus, the source domain images may alternate with the target domain images in the sequence of input images 105. That is, during training of the device 100, the source domain image and the target domain image may be fed into the shared encoder network 101 in sequence.

Furthermore, efficient data distribution modeling is beneficial for the design purposes of the trainable device 100. While classical image-to-image conversion methods may show some clues as to how to model the image distribution using a generative antagonism network (generative adversarial network, GAN), these methods are typically more focused on changing the image appearance for style conversion, while ignoring semantic level information. Thus, to model the source distribution 201 and the target distribution 202, semantic aware generators (i.e., generator networks 102, 103) are proposed for the device 100 sharing the encoder network 101. As shown in fig. 2 and 3, the trifurcated training pipeline thus comprises four modules, each comprising: a shared encoder network 101, which may be initialized by pre-trained backbone weights; a first generator network 102, which in this embodiment includes a source domain generator network 302; a second generator network 103, which in this embodiment comprises a target domain generator network 303; the semantic segmentation network 104.

During training of the device 100, the two generator networks 102/302, 103/303 switch their roles in accordance with changes in the domain of the input image (source domain input image see fig. 3, target domain input image see fig. 4). If the shared encoder network 101 receives images from a first subset of source domain images and outputs a first (shared) feature map 106, the source generator network 302 acquires the feature map 106 and generates a reconstructed input image 304 as a source image reconstruction. In parallel, the target domain generator network 303 takes the same feature map 106 and generates a domain converted input image 107', in this case the domain converted input image 107' is a target domain converted image. This may be done with the aid of the target discriminator 300 (Dt). The target domain transformed input image 107' is then sent back to the shared encoder network 101 and passed through the source generator network 302 to obtain another reconstructed input image 305, which reconstructed input image 305 may be a cyclical reconstruction of the source domain input image 105. At the same time, the source domain input image 105 and its target conversion, i.e., the target domain converted input image 107', are sent to the semantic segmentation network 104 where one or more ground truth labels 307, 308 can be obtained and shared to minimize segmentation losses and output segmentation results.

Further, in contrast, if the shared encoder network 101 receives the target domain image as the input image 105 from the second subset of the training image set (see fig. 4), the shared encoder network 101 outputs a second (shared) feature map 306. In this case, instead of performing the above-described image conversion, the target domain generator network 303 generates a reconstructed input image 404, which reconstructed input image 404 is a reconstruction of the target domain input image 105, for example by minimizing image reconstruction losses. In parallel, the source domain generator network 302 takes the same second feature map 306 and generates a domain converted input image 107", in this case the domain converted input image 107" is the source domain conversion of the input image 105. This may be done with the aid of the target discriminator 400 (Ds). The domain transformed (source) input image 107 "is then sent back to the shared encoder network 101 and through the target domain generator network 303 to obtain another reconstructed input image 405, which reconstructed input image 405 may be a cyclical reconstruction of the target domain input image 105. Since in this case the ground truth label of the target domain data is not available, the target domain segmentation result may be passed to the segmentation discriminator 402 to compare the target domain segmentation result with the previous source prediction result for the challenge learning. That is, the semantic segmentation network 104 is used to provide the segmentation result 401 of the target domain input image 105 to the segmentation discriminator 402, and the segmentation discriminator 402 is used to compare the segmentation result 401 with the previous segmentation result of the source domain input image 105.

By switching the roles of the generator networks 102/103 and 302/303 with the switching of the types of the input image 105, respectively, it can be ensured that the source domain generator network 302 generates only the source domain image, and the target domain generator network 303 generates only the target domain image, regardless of the input image domain. In this way, when the source generator network 302 is used to reconstruct the source domain input image 105 (reconstructed images 304 and 305), it can learn how to put the true source domain image distribution 201 on its output image. Thus, when the source domain generator network 302 acts as an image converter for providing the input image 107 of the target domain conversion, an efficient target-to-source conversion can be ensured.

By connecting the source domain generator 302 and the target domain generator 303 to the shared encoder network 101, and following the illustrated training pipeline, the shared encoder network 101 can extract domain independent feature maps 106, 306, as training forces any feature map support generator network 102, 103 to generate a source domain version and a target domain version of an image for any domain of the input image 105. Thus, the source domain knowledge can be better adapted to the target domain data.

Furthermore, during the cyclic reconstruction step of the two domains, the design of the device 100 (the trigeminal design) can introduce semantic feature cyclic reconstruction (semantic feature cycle reconstruction, SFCR) penalty to better support training of the device 100. In particular, the layers preceding the final segmentation output layer of the semantic segmentation network 104 may be used, and for both domains, the encoded feature map of the input image 105 and its transformed version may be further passed to obtain deeper semantic features. The semantic segmentation network 104 may be used to obtain deeper semantic features based on the first feature map 106 and the second feature map 306, the minimization of the semantic feature cyclic reconstruction penalty including minimizing feature distances between the deeper semantic features. By minimizing feature distances 301 (fig. 3) and 406 (fig. 4), source domain generator network 302 and target domain generator network 303 may be improved to carry higher levels of semantic information while supporting that shared encoder network 101 and semantic segmentation network 104 are domain independent for input image 105 and its cross-domain transformations 107 (specifically, fig. 3 is 107', fig. 4 is 107').

As a result of training source domain generator network 302 and target domain generator network 303, source domain distribution 201 and target domain distribution 202 may be in an antagonistic pose from the perspective of shared encoder network 101. Thus, for any domain of the input image 105, the shared encoder network 101 is forced to satisfy both the source domain constraint and the target domain constraint applied thereto, thereby generating a domain invariant feature representation for the semantic segmentation network 104 for semantic segmentation.

After the above training (first stage) of the device 100, the pre-trained device 100 may be used to generate pseudo tags 403, for example, with a threshold of 0.9 for each target domain training image. The device 100 may then be trained again (second stage), but using pseudo tags that are involved for weak target supervision. After each source domain image of the first subset and each target domain image of the second subset have been provided to the shared encoder network 101, the semantic segmentation network 104 may operate based on each target domain image of the second subset, wherein the semantic segmentation network 104 generates a pseudo tag 402 for each target domain image of the second subset to obtain a second subset of pseudo tags of the pseudo tagged target domain images. The device 100 may then be further trained by providing each source domain image of the first subset and each target domain image of the pseudo-marker of the second subset as another sequence of input images 105 to the shared encoder network 101. This time, the source domain image may alternate with the pseudo-tagged target domain image in another sequence of input images 105. Through this second phase, the shared encoder network 101 and the semantic segmentation network 104 will be further enhanced for semantic segmentation of the target domain.

In the inference phase, the device 100 may then be configured to receive any target domain image and to operate the shared encoder network 101 based on the received target domain image, wherein the shared encoder network 101 is configured to output a third feature map. The semantic segmentation network 104 may then be operated based on the third feature map, wherein the semantic segmentation network 104 is configured to output a segmentation result of the target domain image. That is, in the inference phase of the device 100, only the shared encoder network 101 and the semantic segmentation network 104 may be employed to segment one or more target domain images.

Fig. 5 provides some qualitative results of an embodiment of the present invention, namely, an apparatus 100 for performing domain-adaptive semantic segmentation. The source domain images of the first subset may belong to a marked GTA5 virtual dataset. The target domain images of the second subset may belong to a Cityscapes real world dataset. The device 100 can be trained using a ResNet-101 based model (for the shared encoder network 101) and a DeepLab-v2 based model (for the semantic segmentation network 104). After training, the trained device 100 and model may be tested on the Cityscapes verification set.

In this regard, fig. 5 shows in column (a) the input Cityscapes image, in column (b) the corresponding ground truth values, and in column (c) the source-only prediction of the source domain image by the device 100 (which means that the device 100 only trains (trains for the semantic segmentation network (104)) using res net-101 (for the shared encoder network 101) and deep lab-v2, and tests directly on the target domain image, i.e. without employing the domain adaptation method described above). Column (d) shows the segmentation prediction for device 100 using the trigeminal architecture and domain adaptation method. It can be derived that this improves the performance of cross-domain semantic segmentation compared to column (c).

Fig. 6 illustrates a method 600 provided by an embodiment of the invention. The method 600 may be performed by the device 100 and may be a method for training the device 100 for semantic image segmentation. Likewise, the device 100 may comprise a shared encoder network 101 connected to the first generator network 102, the second generator network 103 and the semantic segmentation network 104, respectively.

The method 600 comprises a step 601 of obtaining a training image set comprising a first subset of labeled source domain images and a second subset of unlabeled target domain images in a training phase of the device 100. Further, step 602 is included of providing the source domain image or the target domain image as an input image 105 to the shared encoder network 101. Then, step 603 is also included, namely training the shared encoder network 101 based on the input image 105, wherein the shared encoder network 101 outputs the first feature map 106. Subsequently, a step of training (604) the first generator network 102 and the second generator network 103 based on the first feature map 106 is also included, wherein one output domain of the first generator network 102 and the second generator network 103 converts the input image 107. The method 600 comprises step 605 of further training the shared encoder network 101 based on the domain transformed input image 107, wherein the shared encoder network 101 outputs the second feature map 306. Finally, step 606 is also included, namely training the semantic segmentation network 104 based on the input image 105 and the domain transformed input image 107.

The invention has been described in connection with various embodiments and implementations as examples. However, other variations to the claimed subject matter can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the invention, and the independent claims. In the claims and in the description, the word "comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A trainable device (100) for semantic image segmentation, characterized in that the device (100) comprises a shared encoder network (101) connected to a first generator network (102), a second generator network (103) and a semantic segmentation network (104), respectively, and in that the device (100) is adapted to, during a training phase:

obtaining a training image set comprising a first subset of labeled source domain images and a second subset of unlabeled target domain images;

-providing a source domain image or a target domain image as an input image (105) to the shared encoder network (101);

Training the shared encoder network (101) based on the input image (105), wherein the shared encoder network (101) is configured to output a first feature map (106);

training the first and second generator networks (103) based on the first feature map (106), wherein one of the first and second generator networks (102, 103) is used to output a domain transformed input image (107);

further training the shared encoder network (101) based on the domain transformed input image (107), wherein the shared encoder network (101) is configured to output a second feature map (306);

-training the semantic segmentation network (104) based on the input image (105) and the domain transformed input image (107).

2. The trainable device (100) according to claim 1, wherein the other of the first generator network (102) and the second generator network (103) is configured to output a first reconstructed input image (304, 404) when the first generator network (102) and the second generator network (103) are trained based on the first feature map (106).

3. The trainable device (100) according to claim 1 or 2, further being configured to, during the training phase:

The other of the first generator network (102) and the second generator network (103) is further trained based on the second feature map (306), wherein the other of the first generator network (102) and the second generator network (103) is used for outputting a second reconstructed input image (305, 405).

4. A trainable device (100) according to claim 2 or 3, characterized in that the first reconstructed input image (304, 404) and/or the second reconstructed image (305, 405) is obtained by minimizing reconstruction losses.

5. The trainable device (100) according to claim 4, wherein the semantic segmentation network (104) is configured to obtain deeper semantic features based on the first feature map (106) and the second feature map (306), the minimizing of semantic feature loop reconstruction losses comprising minimizing feature distances between the deeper semantic features.

6. The trainable device (100) according to any of claims 2 to 5, wherein the first generator network (102) comprises a source domain generator network (302), the second generator network (103) comprises a target domain generator network (303), and if the input image (105) is a source domain input image selected from the first subset, the device (100) is configured to train the first generator network (102) and the second generator network (103) based on the first feature map (106) by:

Training the source domain generator network (302) based on the first feature map (106), wherein the source domain generator network (302) is configured to output the reconstructed input image;

-training the target domain generator network (303) based on the first feature map (106), wherein the target domain generator network (303) is adapted to output the domain transformed input image (107).

7. The trainable device (100) according to claims 3 and 6, characterized in that the device (100) is configured to train the first generator network (102) and the second generator network (103) based on the second feature map (306) by:

-training the source domain generator network (302) based on the second feature map (306), wherein the source domain generator network (302) is adapted to output the second reconstructed input image (305).

8. The trainable device (100) according to claim 6 or 7, wherein the semantic segmentation network (104) is configured to minimize segmentation losses based on the labels of the source domain input image (105), the domain conversion input image (107) and the source domain input image (105) when training based on the source domain input image (105) and the domain conversion input image (107).

9. The trainable device (100) according to any of claims 2 to 5, wherein the first generator network (102) comprises a source domain generator network (302), the second generator network (103) comprises a target domain generator network (303), and if the input image (105) is a target domain input image selected from the second subset, the device (100) is configured to train the first generator network (102) and the second generator network (103) based on the first feature map (106) by:

training the source domain generator network (302) based on the first feature map (106), wherein the source domain generator network (302) is configured to output the domain transformed input image (107);

-training the target domain generator network (303) based on the first feature map (106), wherein the target domain generator network (303) is adapted to output the reconstructed input image (404).

10. The trainable device (100) according to claims 3 and 9, characterized in that the device (100) is configured to train the first generator network (102) and the second generator network (103) based on the second feature map (306) by:

-training the target domain generator network (303) based on the second feature map (306), wherein the target domain generator network (303) is adapted to output the second reconstructed input image (405).

11. The trainable device (100) according to claim 9 or 10, wherein the semantic segmentation network (104) is configured to provide a segmentation result (401) of the target domain input image (105) to a segmentation discriminator (402) of the trainable device (400), the segmentation discriminator (402) being configured to compare the segmentation result (401) with a previous segmentation result of a source domain input image (105).

12. The trainable device (100) of any of claims 1 to 11, wherein the trainable device (100) is configured to, during the training phase:

-providing each of the source domain images of the first subset and each of the target domain images of the second subset as a sequence of input images (105) to the shared encoder network (101), wherein the source domain images alternate with the target domain images in the sequence of input images (105).

13. The trainable device (100) of claim 12 further configured to, after each source domain image of the first subset and each target domain image of the second subset has been provided to the shared encoder network (101):

operating the semantic segmentation network (104) based on each target domain image of the second subset, wherein the semantic segmentation network (104) is configured to generate a pseudo tag (403) for each target domain image of the second subset to obtain a second subset of pseudo tags of the pseudo tagged target domain images;

The device (100) is further trained by providing each of the source domain images of the first subset and each of the target domain images of the pseudo-markers of the second subset of pseudo-markers as a further sequence of input images (105) to the shared encoder network (101), wherein the source domain images alternate with the target domain images of the pseudo-markers in the further sequence of input images (105).

14. The trainable device (100) of any of claims 1 to 13, wherein the marked source domain image is a segmented marked virtual world image and the unmarked target domain image is an unmarked real world image.

15. The trainable device (100) of any of claims 1 to 14, wherein the first feature map (106) and/or the second feature map (306) comprises domain independent segmentation features.

16. The trainable device (100) according to any of claims 1 to 15, characterized in that the device (100) is further adapted to, in an inference phase:

receiving any target domain image;

operating the shared encoder network (101) based on the received target domain image, wherein the shared encoder network (101) is configured to output a third feature map;

Operating the semantic segmentation network (104) based on the third feature map, wherein the semantic segmentation network (104) is configured to output a segmentation result of the target domain image.

17. A method (600) for training a device (100) for semantic image segmentation, the device (100) comprising a shared encoder network (101) connected to a first generator network (102), a second generator network (103) and a semantic segmentation network (104), respectively, the method (100) comprising, in a training phase of the device (100):

obtaining (601) a training image set comprising a first subset of labeled source domain images and a second subset of unlabeled target domain images;

-providing (602) a source domain image or a target domain image as an input image (105) to the shared encoder network (101);

training (603) the shared encoder network (101) based on the input image (105), wherein the shared encoder network (101) is configured to output a first feature map (106);

training (604) the first generator network (102) and the second generator network (103) based on the first feature map (106), wherein one of the first generator network (102) and the second generator network (103) is used for outputting a domain converted input image (107);

-further training (605) the shared encoder network (101) based on the domain transformed input image (107), wherein the shared encoder network (101) is for outputting a second feature map (306);

-training (606) the semantic segmentation network (104) based on the input image (105) and the domain transformed input image (107).

18. A computer program comprising a program code for performing the method (600) according to claim 17 when executed on a computer.