TWI840637B

TWI840637B - Training method and training system of generative adversarial network for image cross domain conversion

Info

Publication number: TWI840637B
Application number: TW109144098A
Authority: TW
Inventors: 林哲聰; 范峻; 郭宗憲
Original assignee: 財團法人工業技術研究院
Priority date: 2020-09-04
Filing date: 2020-12-14
Publication date: 2024-05-01

Abstract

A training method and a training system of generative adversarial network for image cross domain conversion are provided. The generative confrontation network is used for performing image cross domain conversion with preserved structure. The training method of generative confrontation network for image multimodal conversion includes the following steps. A first real image is inputted to a first generator to obtain a first generated image. The first real image has a first mode, and the first generated image has a second mode. The first generated image is inputted to a second generator to obtain a first reconstructed image. The parameters of the first generator and the second generator are updated according to a plurality of loss values, so as to achieve image multimodal conversion with cyclically consistent or similar structure.

Description

Training method and training system for generative adversarial network for performing multimodal image conversion

本案是有關於一種執行影像多模態轉換之生成式對抗網路的訓練方法與訓練系統。 This case is about a training method and training system for a generative adversarial network that performs multimodal image conversion.

近年來，深度學習技術的興起，將各類影像辨識技術的辨識率大幅提高。深度學習模型需要採用大量標記資料進行訓練。然而，標記資料是相關研發單位/公司，產品開發中最重要且絕不公開的資產。許多中小企業即使具有深度學習模型開發能力，卻受限於人力或物力，使得自行標記資料之數量極為有限，使得產品化能力無法與國際大廠並駕齊驅，尤其是在行車物件偵測/語意分割/實例分割等應用上。在這些應用中，尤其是語意分割及實例分割，需要在影像中逐像素進行標記，因而耗費巨量人力成本。 In recent years, the rise of deep learning technology has greatly improved the recognition rate of various image recognition technologies. Deep learning models require a large amount of labeled data for training. However, labeled data is the most important and never public asset for relevant R&D units/companies in product development. Even if many small and medium-sized enterprises have the ability to develop deep learning models, they are limited by manpower or material resources, making the amount of self-labeled data extremely limited, making it impossible for them to compete with international manufacturers in productization, especially in applications such as vehicle object detection/semantic segmentation/instance segmentation. In these applications, especially semantic segmentation and instance segmentation, it is necessary to label each pixel in the image, which consumes a huge amount of manpower costs.

執行影像多模態轉換之生成式對抗網路(Generative Adversarial Network,GAN)提出後，深度學習模型開始具有產生影像的能力，許多後續研究更進化到能將影像從A風格(模態)轉換至B風格(模態)等，運用這樣強大的影像轉換能力，使得影像分割或是物件偵測有可能達到所謂的多模態應用(cross domain adaptation)，也就是運用GAN來將已標記的訓練資料從原始模態(source domain)轉換到目標模態(target domain)。因而增加影像分割或是物件偵測模型在目標模態(target domain)的泛用性與強健性。然而，現存的模型仍然無法達到這樣的功能的主要原因在於，現有之GAN雖能做到影像風格極大的改變，但影像中所標記的物件或是特定類別的像素往往在轉換後的位置會改變，而這將使得影像需要重新標記，使得GAN無法運用在標記資料之影像風格轉換。 After the introduction of the Generative Adversarial Network (GAN) for performing multimodal image conversion, deep learning models began to have the ability to generate images. Many subsequent studies have further evolved to the ability to convert images from style A (modality) to style B (modality), etc. Using such powerful image conversion capabilities, image segmentation or object detection may achieve the so-called multimodal application (cross domain adaptation), that is, using GAN to convert labeled training data from the original modality (source domain) to the target modality (target domain). This increases the versatility and robustness of image segmentation or object detection models in the target modality (target domain). However, the main reason why existing models still cannot achieve such a function is that although existing GANs can achieve a significant change in image style, the objects or pixels of a specific category marked in the image often change their positions after the transformation, which will require the image to be re-labeled, making GAN unable to be used in image style transformation of labeled data.

因此，研究人員正在開發一種執行影像多模態轉換之生成式對抗網路，期望其可將一張原始模態影像轉換至不同程度目標模態影像，且每張轉換後的目標模態影像的結構，都能夠與轉換前維持一致或相似。 Therefore, researchers are developing a generative adversarial network that performs multimodal image conversion, hoping that it can convert an original modality image into target modality images of varying degrees, and the structure of each converted target modality image can remain consistent or similar to that before conversion.

本案係有關於一種執行影像多模態轉換之生成式對抗網路的訓練方法與訓練系統。 This case is about a training method and training system for a generative adversarial network that performs multimodal image conversion.

根據本案之一實施例，提出一種執行影像多模態轉換之生成式對抗網路的訓練方法。生成式對抗網路用以進行結構保存之影像多模態轉換。執行影像多模態轉換之生成式對抗網路的訓練方法包括以下步驟。輸入一第一真實影像至一第一生成器，以獲得一第一生成影像。第一真實影像具有一第一模態，第一生成影像具有一第二模態。輸入第一生成影像至一第二生成器，以獲得一第一結構重建影像。第一結構重建影像具有第一模態。輸入第一生成影像至一第一鑑別器，以計算一第一損失值。輸入第一生成影像至一第一影像分割網路，以計算一第二損失值。輸入第一結構重建影像至一第二鑑別器，以計算一第三損失值。輸入第一結構重建影像至一第二影像分割網路，以計算一第四損失值。更新第一生成器及該第二生成器的參數，以訓練執行影像多模態轉換之生成式對抗網路。第一生成器及第二生成器的參數至少依據第一損失值、第二損失值、第三損失值及第四損失值進行更新。 According to an embodiment of the present case, a training method for a generative adversarial network that performs image multimodal conversion is proposed. The generative adversarial network is used to perform structure-preserving image multimodal conversion. The training method for a generative adversarial network that performs image multimodal conversion includes the following steps. Input a first real image to a first generator to obtain a first generated image. The first real image has a first modality, and the first generated image has a second modality. Input the first generated image to a second generator to obtain a first structural reconstruction image. The first structural reconstruction image has a first modality. Input the first generated image to a first discriminator to calculate a first loss value. Input the first generated image to a first image segmentation network to calculate a second loss value. Input the first structural reconstruction image to a second discriminator to calculate a third loss value. Input the first structured reconstructed image to a second image segmentation network to calculate a fourth loss value. Update the parameters of the first generator and the second generator to train a generative adversarial network that performs image multimodal conversion. The parameters of the first generator and the second generator are updated based on at least the first loss value, the second loss value, the third loss value, and the fourth loss value.

根據本案之另一實施例，提出一種執行影像多模態轉換之生成式對抗網路的訓練系統。生成式對抗網路用以進行結構保存之影像多模態轉換。執行影像多模態轉換之生成式對抗網路的訓練系統包括一第一生成器、一第二生成器、一第一鑑別器、一第二鑑別器、一第一影像分割網路及一第二影像分割網路。第一生成器用以接收一第一真實影像，以獲得一第一生成影像。第一真實影像具有一第一模態。第一生成影像具有一第二模態。第二生成器用以接收第一生成影像，以獲得一第一結構重建影像。第一結構重建影像具有第一模態。第一鑑別器用以接收第一生成影像，以計算一第一損失值。第二鑑別器用以接收第一結構重建影像，以計算一第二損失值。第一影像分割網路用以接收第一生成影像，以計算一第三損失值。第二影像分割網路用以接收第一結構重建影像，以計算一第四損失值。第一生成器及第二生成器的參數至少依據第一損失值、第二損失值、第三損失值及第四損失值進行更新，以訓練執行影像多模態轉換之生成式對抗網路。 According to another embodiment of the present invention, a training system for a generative adversarial network for performing image multimodal conversion is provided. The generative adversarial network is used for performing structure-preserving image multimodal conversion. The training system for a generative adversarial network for performing image multimodal conversion includes a first generator, a second generator, a first discriminator, a second discriminator, a first image segmentation network, and a second image segmentation network. The first generator is used for receiving a first real image to obtain a first generated image. The first real image has a first modality. The first generated image has a second modality. The second generator is used for receiving the first generated image to obtain a first structural reconstruction image. The first structural reconstruction image has a first modality. The first discriminator is used for receiving the first generated image to calculate a first loss value. The second discriminator is used to receive the first structural reconstruction image to calculate a second loss value. The first image segmentation network is used to receive the first generated image to calculate a third loss value. The second image segmentation network is used to receive the first structural reconstruction image to calculate a fourth loss value. The parameters of the first generator and the second generator are updated at least according to the first loss value, the second loss value, the third loss value and the fourth loss value to train the generative adversarial network for performing image multimodal conversion.

為了對本案之上述及其他方面有更佳的瞭解，下文特舉實施例，並配合所附圖式詳細說明如下： In order to have a better understanding of the above and other aspects of this case, the following is a specific example, and the attached drawings are explained in detail as follows:

100,200:執行影像多模態轉換之生成式對抗網路的訓練系統 100,200: Training system for generative adversarial networks for image multimodal transformation

D1:第一鑑別器 D1: First discriminator

D2:第二鑑別器 D2: Second discriminator

D_x,D_y:鑑別器 D _x ,D _y : discriminator

E_x,E_y,

,

:編碼器 E _x ,E _y ,

,

:Encoder

FK1:第一生成影像 FK1: First generated image

FK2:第二生成影像 FK2: Second generation image

G1:第一生成器 G1: First generator

G2:第二生成器 G2: Second generator

GT1:第一真實影像 GT1: The first real image

GT2:第二真實影像 GT2: Second Reality Image

G_x,G_y:生成器 G _x ,G _y : generator

LS1:第一損失值 LS1: First loss value

LS2:第二損失值 LS2: Second loss value

LS3:第三損失值 LS3: Third loss value

LS4:第四損失值 LS4: Fourth loss value

LS5:第五損失值 LS5: Fifth loss value

LS6:第六損失值 LS6: Sixth loss value

LS7:第七損失值 LS7: Seventh loss value

LS8:第八損失值 LS8: Eighth loss value

l_mce:比對函數 l _mce : comparison function

N(z):隨機雜訊 N(z): Random noise

P_x,P_y:分割器 P _x ,P _y : Splitter

RC1:第一結構重建影像 RC1: First structural reconstruction image

RC2:第二結構重建影像 RC2: Second structural reconstruction image

S1:第一影像分割網路 S1: The first image segmentation network

S2:第二影像分割網路 S2: Second image segmentation network

S110,S120,S130,S170,S150,S160,S170,S210,S220,S230,S240,S250,S260,S270:步驟 S110,S120,S130,S170,S150,S160,S170,S210,S220,S230,S240,S250,S260,S270: Steps

SG1:第一真實影像的影像分割真值 SG1: The image segmentation truth of the first real image

SG2:第二真實影像的影像分割真值 SG2: Image segmentation truth of the second ground truth image

x:第一真實影像 x:First real image

:第一真實影像之影像分割真值

:The first real image segmentation truth

:第二生成影像

: Second generation image

:第二生成影像的影像分割

: Image segmentation of the second generated image

x_rec:第一結構重建影像 x _rec : First structural reconstruction image

:第一結構重建影像的影像分割

: Image segmentation of the first structural reconstruction image

y:第二真實影像 y: Second real image

:第二真實影像之影像分割真值

:Second ground truth image segmentation

:第一生成影像

:First generation image

:第一生成影像的影像分割

: Image segmentation of the first generated image

y_rec:第二結構重建影像 y _rec : Second structural reconstruction image

:第二結構重建影像的影像分割

: Image segmentation of the second structure reconstruction image

第1圖繪示根據一實施例之執行影像多模態轉換之生成式對抗網路的訓練系統的方塊圖。 FIG. 1 is a block diagram of a training system for a generative adversarial network for performing image multimodal transformation according to an embodiment.

第2圖繪示根據一實施例執行影像多模態轉換之生成式對抗網路的訓練方法的流程圖。 FIG. 2 shows a flow chart of a method for training a generative adversarial network for performing image multimodal conversion according to an embodiment.

第3圖繪示根據另一實施例之執行影像多模態轉換之生成式對抗網路的訓練系統的方塊圖。 FIG. 3 shows a block diagram of a training system for a generative adversarial network for performing image multimodal conversion according to another embodiment.

第4圖繪示根據另一實施例執行影像多模態轉換之生成式對抗網路的訓練方法的流程圖。 FIG. 4 shows a flow chart of a method for training a generative adversarial network for performing image multimodal conversion according to another embodiment.

第5圖繪示執行影像多模態轉換之生成式對抗網路的網路架構圖。 Figure 5 shows the network architecture of the generative adversarial network that performs image multimodal transformation.

第6圖示例說明原始的第一真實影像。 Figure 6 illustrates an example of the original first real image.

第7圖示例說明加入隨機雜訊後，所模擬出夜晚的第一生成影像。 Figure 7 shows an example of the first generated image of the night after adding random noise.

第8圖示例說明第一結構重建影像。 Figure 8 shows an example of the first structure reconstruction image.

第9圖示例說明第一真實影像之影像分割真值。 Figure 9 illustrates the image segmentation truth value of the first real image.

第10圖示例說明第一生成影像之影像分割。 Figure 10 illustrates the image segmentation of the first generated image.

第11圖示例說明第一結構重建影像之影像分割結果。 Figure 11 illustrates the image segmentation results of the first structural reconstruction image.

本案提出執行影像多模態轉換之生成式對抗網路的訓練方法。訓練出來的生成式對抗網路可以進行結構保存之影像多模態轉換。所謂的「結構」是指影像之物件及其位置。「結構保存」是指生成影像能夠保存原影像之物件及其位置。以行車影像為例，白天行車影像可以透過本案訓練出來的生成式對抗網路擴增出夜晚行車影像作為深度學習的訓練樣本。所擴增出的夜晚行車影像的物件位置與白天行車影像一致或相似，結構得以保存，而無須重新標記。不僅解決標記資料不足之行車物件偵測辨識率下降問題，亦可達成避免需要大量標記資料之功效。其中，本案之生成式對抗網路係應用在訓練樣本的擴增。訓練樣本例如是影像，訓練樣本的應用模型為各種影像物件偵測模型。影像物件偵測模型的輸入數據為一張影像，輸出數據為物件之位置。 This case proposes a method for training a generative adversarial network that performs image multimodal transformation. The trained generative adversarial network can perform structure-preserving image multimodal transformation. The so-called "structure" refers to the objects and their positions in the image. "Structure preservation" means that the generated image can preserve the objects and their positions in the original image. Taking driving images as an example, daytime driving images can be expanded into nighttime driving images as training samples for deep learning through the generative adversarial network trained in this case. The object positions of the expanded nighttime driving images are consistent or similar to those of the daytime driving images, and the structure is preserved without the need for re-labeling. It not only solves the problem of decreased recognition rate of driving object detection due to insufficient labeled data, but also avoids the need for a large amount of labeled data. Among them, the generative adversarial network in this case is applied to the expansion of training samples. The training samples are, for example, images, and the application models of the training samples are various image object detection models. The input data of the image object detection model is an image, and the output data is the location of the object.

本案提出之執行影像多模態轉換之生成式對抗網路的訓練方法包含前進循環架構(forward cycle)與退後循環架構(backward cycle)。前進循環架構係為第一模態轉換至第二模態的訓練，退後循環架構則為第二模態轉換至第一模態的訓練。前進循環架構與退後循環架構兩者的結合可以使第一模態與第二模態之間的互相轉換都能夠相當準確。 The training method of the generative adversarial network for performing multimodal image conversion proposed in this case includes a forward cycle architecture (forward cycle) and a backward cycle architecture (backward cycle). The forward cycle architecture is for training the conversion from the first mode to the second mode, and the backward cycle architecture is for training the conversion from the second mode to the first mode. The combination of the forward cycle architecture and the backward cycle architecture can make the conversion between the first mode and the second mode quite accurate.

在一實施例中，執行影像多模態轉換之生成式對抗網路的訓練方法可以僅有前進循環架構。以下先說明前進循環架構。 In one embodiment, the training method of the generative adversarial network for performing image multimodal conversion may only have a forward-looking loop architecture. The forward-looking loop architecture is first described below.

請參照第1圖，其繪示根據一實施例之執行影像多模態轉換之生成式對抗網路的訓練系統100的方塊圖。第1圖之執行影像多模態轉換之生成式對抗網路的訓練系統100僅包含前進循環架構。執行影像多模態轉換之生成式對抗網路的訓練系統100包括一第一生成器G1、一第二生成器G2、一第一鑑別器D1、一第二鑑別器D2、一第一影像分割網路S1及一第二影像分割網路S2。各項元件之功能概述如下。第一生成器G1及第二生成器G2用以進行模態、風格的轉換。第一鑑別器D1及第二鑑別器D2用以進行生成資料的驗證。第一影像分割網路S1及第二影像分割網路S2用以進行物件分割與標註，例如是語意分割網路。第一生成器G1、第二生成器G2、第一鑑別器D1、第二鑑別器D2、第一影像分割網路S1及第二影像分割網路S2例如是一電路、一晶片、一電路板或儲存程式碼之儲存裝置。 Please refer to Figure 1, which shows a block diagram of a training system 100 for a generative adversarial network that performs image multimodal conversion according to an embodiment. The training system 100 for a generative adversarial network that performs image multimodal conversion in Figure 1 only includes a forward loop architecture. The training system 100 for a generative adversarial network that performs image multimodal conversion includes a first generator G1, a second generator G2, a first discriminator D1, a second discriminator D2, a first image segmentation network S1, and a second image segmentation network S2. The functions of each component are summarized as follows. The first generator G1 and the second generator G2 are used to perform mode and style conversion. The first discriminator D1 and the second discriminator D2 are used to verify the generated data. The first image segmentation network S1 and the second image segmentation network S2 are used for object segmentation and labeling, such as semantic segmentation networks. The first generator G1, the second generator G2, the first discriminator D1, the second discriminator D2, the first image segmentation network S1 and the second image segmentation network S2 are, for example, a circuit, a chip, a circuit board or a storage device for storing program codes.

第一生成器G1主要目的是將第一模態(例如是白天)之第一真實影像GT1轉換為第二模態(例如是夜晚)之第一生成影像FK1。第一生成器G1被注入隨機雜訊N(z)後，透過學習與訓練，讓第一生成影像FK1通過第一鑑別器D1的驗證。另外，為了維持第一生成影像FK1與第一真實影像GT1之結構一致性或相似性，第一影像分割網路S1可以協助第一生成影像FK1進行結構確認，以確保影像轉換前後的結構得以保存或相似。上述之第一模態與第二模態係以白天與夜晚為例做說明，在另一實施例中，第一模態與第二模態亦可以式風格差異不大的情況，例如是晴天與陰天。 The main purpose of the first generator G1 is to convert the first real image GT1 of the first mode (for example, daytime) into the first generated image FK1 of the second mode (for example, nighttime). After the first generator G1 is injected with random noise N(z), the first generated image FK1 is verified by the first discriminator D1 through learning and training. In addition, in order to maintain the structural consistency or similarity between the first generated image FK1 and the first real image GT1, the first image segmentation network S1 can assist the first generated image FK1 in structural confirmation to ensure that the structure before and after the image conversion is preserved or similar. The above-mentioned first mode and second mode are explained by taking daytime and night as examples. In another embodiment, the first mode and the second mode can also be in a situation where the style difference is not much, such as sunny and cloudy.

再者，為了防止影像轉換的模態、風格過於強烈，第二生成器G2將第二模態之第一生成影像FK1再轉換回第一模態之第一結構重建影像RC1，並通過第二鑑別器D2的驗證。此外，第一結構重建影像RC1與第一真實影像GT1也需要維持結構一致性或相似性，第二影像分割網路S2可以協助第一結構重建影像RC1進行結構確認，以確保影像轉換前後的結構一致或近似。以下透過一流程圖詳細說明上述各項元件之運作。所謂「結構重建」是指第一結構重建影像RC1的物件及其位置被重建出來，以使第一結構重建影像RC1的物件與第一真實影像GT1的物件及其位置相似。 Furthermore, in order to prevent the mode and style of the image conversion from being too strong, the second generator G2 converts the first generated image FK1 of the second mode back to the first structural reconstruction image RC1 of the first mode, and passes the verification of the second discriminator D2. In addition, the first structural reconstruction image RC1 and the first real image GT1 also need to maintain structural consistency or similarity. The second image segmentation network S2 can assist the first structural reconstruction image RC1 in structural confirmation to ensure that the structure before and after the image conversion is consistent or similar. The operation of the above-mentioned components is described in detail below through a flowchart. The so-called "structural reconstruction" means that the objects and their positions of the first structural reconstruction image RC1 are reconstructed so that the objects of the first structural reconstruction image RC1 are similar to the objects and their positions of the first real image GT1.

請參照第1~2圖，第2圖繪示根據一實施例執行影像多模態轉換之生成式對抗網路的訓練方法的流程圖。第2圖之執行影像多模態轉換之生成式對抗網路的訓練方法僅包含前進循環架構。在步驟S110中，如第1圖所示，輸入第一真實影像GT1至第一生成器G1，以獲得第一生成影像FK1。第一真實影像GT1具有第一模態，第一生成影像FK1具有第二模態。第一真實影像GT1係由影像感測器擷取。影像感測器例如是一紅外線影像擷取裝置、一光電耦合元件、一互補式金氧半導體光學感測元件、或其任意組合。以擷取行車影像之攝影機為例，其可裝置於前、後、左、右或是任何能擷取車輛四周影像之位置。第一模態例如是白天，第二模態例如是夜晚、雨天、夕陽、下雪天等。 Please refer to Figures 1 and 2, Figure 2 shows a flow chart of a method for training a generative adversarial network for performing image multimodal conversion according to an embodiment. The method for training a generative adversarial network for performing image multimodal conversion in Figure 2 only includes a forward loop architecture. In step S110, as shown in Figure 1, a first real image GT1 is input to a first generator G1 to obtain a first generated image FK1. The first real image GT1 has a first mode, and the first generated image FK1 has a second mode. The first real image GT1 is captured by an image sensor. The image sensor is, for example, an infrared image capture device, a photocoupler, a complementary metal oxide semiconductor optical sensing element, or any combination thereof. Taking a camera for capturing driving images as an example, it can be installed in the front, back, left, right, or any position that can capture images around the vehicle. The first mode is, for example, daytime, and the second mode is, for example, night, rainy day, sunset, snowy day, etc.

如第1圖所示，第一生成器G1被注入隨機雜訊N(z)，以獲得第一生成影像FK1。請參照第6圖~第11圖。第6圖示例說明原始的第一真實影像GT1，其為白天所拍攝到的道路情況。第7圖示例說明加入隨機雜訊N(z)後，所模擬出夜晚的第一生成影像FK1。加入不同程度的隨機雜訊N(z)可以獲得不同程度的第一生成影像FK1，如傍晚或深夜。 As shown in Figure 1, the first generator G1 is injected with random noise N(z) to obtain the first generated image FK1. Please refer to Figures 6 to 11. Figure 6 illustrates the original first real image GT1, which is a road condition taken during the day. Figure 7 illustrates the first generated image FK1 simulated at night after adding random noise N(z). Adding different levels of random noise N(z) can obtain different levels of first generated images FK1, such as evening or late at night.

接著，在步驟S120中，如第1圖所示，輸入第一生成影像FK1至第二生成器G2，以獲得第一結構重建影像RC1。第一結構重建影像RC1具有第一模態。也就是說，原本第一模態之第一真實影像GT1轉換為第二模態之第一生成影像FK1後，再轉換回第一模態之第一結構重建影像RC1。請參照第8圖，第8圖示例說明第一結構重建影像RC1。 Next, in step S120, as shown in FIG. 1, the first generated image FK1 is input to the second generator G2 to obtain the first structural reconstruction image RC1. The first structural reconstruction image RC1 has a first modality. That is, the first real image GT1 of the original first modality is converted into the first generated image FK1 of the second modality, and then converted back to the first structural reconstruction image RC1 of the first modality. Please refer to FIG. 8, which illustrates the first structural reconstruction image RC1.

然後，在步驟S130中，如第1圖所示，輸入第一生成影像FK1至第一鑑別器D1，以計算一第一損失值LS1。第一損失值LS1代表轉換後之第二模態的第一生成影像FK1是否與第二模態的第二真實影像GT2(標示於第3圖)相似。第二真實影像GT2與第一真實影像GT1不需要完全或幾乎重疊，只要大體上類似即可。 Then, in step S130, as shown in FIG. 1, the first generated image FK1 is input to the first discriminator D1 to calculate a first loss value LS1. The first loss value LS1 represents whether the converted first generated image FK1 of the second modality is similar to the second real image GT2 of the second modality (indicated in FIG. 3). The second real image GT2 does not need to completely or almost overlap with the first real image GT1, as long as they are generally similar.

接著，在步驟S140中，如第1圖所示，輸入第一生成影像FK1至第一影像分割網路S1，以計算一第二損失值LS2。在此步驟中，第二損失值LS2係依據第一真實影像GT1的影像分割真值(Ground-Truth)SG1與第一生成影像FK1之影像分割進行計算。請參照第9圖，其示例說明第一真實影像GT1之影像分割真值SG1。請參照第10圖，其示例說明第一生成影像FK1之影像分割。第二損失值LS2代表第一生成影像FK1的結構與第一真實影像GT1的結構是否一致或近似。在生成式對抗網路的訓練過程中考慮了第二損失值LS2可以確保第一生成器G1所生成之第一生成影像FK1的物件位置不會改變或近似。 Next, in step S140, as shown in FIG. 1, the first generated image FK1 is input to the first image segmentation network S1 to calculate a second loss value LS2. In this step, the second loss value LS2 is calculated based on the image segmentation ground truth SG1 of the first real image GT1 and the image segmentation of the first generated image FK1. Please refer to FIG. 9, which illustrates the image segmentation ground truth SG1 of the first real image GT1. Please refer to FIG. 10, which illustrates the image segmentation of the first generated image FK1. The second loss value LS2 represents whether the structure of the first generated image FK1 is consistent or similar to the structure of the first real image GT1. Considering the second loss value LS2 during the training process of the generative adversarial network can ensure that the object position of the first generated image FK1 generated by the first generator G1 will not change or be similar.

然後，在步驟S150中，如第1圖所示，輸入第一結構重建影像RC1至第二鑑別器D2，以計算一第三損失值LS3。第三損失值LS3代表轉換後之第一模態的第一結構重建影像RC1是否與第一模態的第一真實影像GT1相似。在生成式對抗網路的訓練過程中考慮了第三損失值LS3可以確保第一生成器G1所生成之第二模態的第一生成影像FK1也能夠正確還原回第一模態。 Then, in step S150, as shown in FIG. 1, the first structured reconstructed image RC1 is input to the second discriminator D2 to calculate a third loss value LS3. The third loss value LS3 represents whether the first structured reconstructed image RC1 of the first modality after conversion is similar to the first real image GT1 of the first modality. The third loss value LS3 is considered in the training process of the generative adversarial network to ensure that the first generated image FK1 of the second modality generated by the first generator G1 can also be correctly restored to the first modality.

接著，在步驟S160中，如第1圖所示，輸入第一結構重建影像RC1至第二影像分割網路S2，以計算一第四損失值LS4。在此步驟中，第四損失值LS4係依據第一真實影像GT1的影像分割真值SG1與第一結構重建影像RC1之影像分割進行計算。請參照第11圖，其示例說明第一結構重建影像RC1之影像分割結果。第四損失值LS4代表第一結構重建影像RC1的結構與第一真實影像GT1的結構是否一致或近似。在生成式對抗網路的訓練過程中考慮了第四損失值LS4可以確保第二生成器G2所生成之第一結構重建影像RC1的物件位置不會改變或近似。步驟S130~步驟S160並不限定於第2圖所示例之順序，亦可以同時執行或變更為任何順序，只要是執行於步驟S170之前即可。 Next, in step S160, as shown in FIG. 1, the first structural reconstructed image RC1 is input to the second image segmentation network S2 to calculate a fourth loss value LS4. In this step, the fourth loss value LS4 is calculated based on the image segmentation truth value SG1 of the first real image GT1 and the image segmentation of the first structural reconstructed image RC1. Please refer to FIG. 11, which illustrates an example of the image segmentation result of the first structural reconstructed image RC1. The fourth loss value LS4 represents whether the structure of the first structural reconstructed image RC1 is consistent or similar to the structure of the first real image GT1. Considering the fourth loss value LS4 during the training process of the generative adversarial network can ensure that the object position of the first structural reconstructed image RC1 generated by the second generator G2 will not change or be similar. Steps S130 to S160 are not limited to the order shown in Figure 2, and can be executed simultaneously or changed to any order, as long as they are executed before step S170.

然後，在步驟S170中，更新第一生成器G1及第二生成器G2的參數，以訓練執行影像多模態轉換之生成式對抗網路。在前進循環架構(forward cycle)之實施例中，第一生成器G1及第二生成器G2的參數依據第一損失值LS1、第二損失值LS2、第三損失值LS3及第四損失值LS4進行更新。 Then, in step S170, the parameters of the first generator G1 and the second generator G2 are updated to train the generative adversarial network for performing image multimodal conversion. In the implementation of the forward cycle architecture, the parameters of the first generator G1 and the second generator G2 are updated according to the first loss value LS1, the second loss value LS2, the third loss value LS3 and the fourth loss value LS4.

上述內容為前進循環架構，在訓練生成式對抗網路的過程中，透過第一損失值LS1可以確保轉換後之第二模態的第一生成影像FK1與第二模態的第二真實影像GT2(標示於第3圖)相似。透過第二損失值LS2可以確保第一生成器G1所生成之第一生成影像FK1的物件位置不會改變。透過第三損失值LS3可以確保第一生成器G1所生成之第二模態的第一生成影像FK1也能夠還原回第一模態。透過第四損失值LS4可以確保第二生成器G2所生成之第一結構重建影像RC1的物件位置不會改變。 The above content is a forward loop architecture. In the process of training the generative adversarial network, the first loss value LS1 can ensure that the first generated image FK1 of the second modality after conversion is similar to the second real image GT2 of the second modality (marked in Figure 3). The second loss value LS2 can ensure that the object position of the first generated image FK1 generated by the first generator G1 will not change. The third loss value LS3 can ensure that the first generated image FK1 of the second modality generated by the first generator G1 can also be restored to the first modality. The fourth loss value LS4 can ensure that the object position of the first structural reconstruction image RC1 generated by the second generator G2 will not change.

如前所述，前進循環架構與退後循環架構兩者的結合可以使第一模態與第二模態之間的相互轉換都能夠相當準確。以下接續說明前進循環架構與退後循環架構兩者的結合。 As mentioned above, the combination of the forward loop architecture and the backward loop architecture can make the conversion between the first mode and the second mode quite accurate. The following will continue to explain the combination of the forward loop architecture and the backward loop architecture.

請參照第3圖，其繪示根據另一實施例之執行影像多模態轉換之生成式對抗網路的訓練系統200。第3圖之執行影像多模態轉換之生成式對抗網路的訓練系統200同時包括前進循環架構與退後循環架構。執行影像多模態轉換之生成式對抗網路的訓練系統200包括第一生成器G1、第二生成器G2、第一鑑別器D1、第二鑑別器D2、第一影像分割網路S1及第二影像分割網路S2。第3圖之上半部為前進循環架構，第3圖之下半部為退後循環架構。前進循環架構同上所述，在此不再重複敘述。以下透過一流程圖詳細說明各項元件之運作。 Please refer to FIG. 3, which shows a training system 200 for a generative adversarial network for performing image multimodal conversion according to another embodiment. The training system 200 for a generative adversarial network for performing image multimodal conversion in FIG. 3 includes both a forward loop architecture and a backward loop architecture. The training system 200 for a generative adversarial network for performing image multimodal conversion includes a first generator G1, a second generator G2, a first discriminator D1, a second discriminator D2, a first image segmentation network S1, and a second image segmentation network S2. The upper half of FIG. 3 is a forward loop architecture, and the lower half of FIG. 3 is a backward loop architecture. The forward loop architecture is the same as described above and will not be repeated here. The operation of each component is described in detail below through a flow chart.

請參照第3~4圖，第4圖繪示根據另一實施例執行影像多模態轉換之生成式對抗網路的訓練方法的流程圖。第4圖之執行影像多模態轉換之生成式對抗網路的訓練方法同時包括前進循環架構與退後循環架構。步驟S110~S160係為前進循環架構，步驟S210~S270係為退後循環架構。步驟S110~S160之前進循環架構同上所述，在此不再重複敘述。 Please refer to Figures 3-4. Figure 4 shows a flow chart of a training method for a generative adversarial network for performing image multimodal conversion according to another embodiment. The training method for a generative adversarial network for performing image multimodal conversion in Figure 4 includes both a forward loop architecture and a backward loop architecture. Steps S110-S160 are the forward loop architecture, and steps S210-S270 are the backward loop architecture. The forward loop architecture of steps S110-S160 is the same as described above, and will not be repeated here.

在步驟S210中，輸入第二真實影像GT2至第二生成器G2，以獲得一第二生成影像FK2。第二真實影像GT2具有第二模態，第二生成影像FK2具有第一模態。第二真實影像GT2係由影像感測器擷取。影像感測器例如是一紅外線影像擷取裝置、一光電耦合元件、一互補式金氧半導體光學感測元件、或其任意組合。第一模態例如是白天，第二模態例如是夜晚、雨天、夕陽、下雪天等。 In step S210, the second real image GT2 is input to the second generator G2 to obtain a second generated image FK2. The second real image GT2 has a second mode, and the second generated image FK2 has a first mode. The second real image GT2 is captured by an image sensor. The image sensor is, for example, an infrared image capture device, a photocoupler, a complementary metal oxide semiconductor optical sensing element, or any combination thereof. The first mode is, for example, daytime, and the second mode is, for example, night, rainy day, sunset, snowy day, etc.

接著，在步驟S220中，如第3圖所示，輸入第二生成影像FK2至第一生成器G1，以獲得一第二結構重建影像RC2。第二結構重建影像RC2具有第二模態。如第3圖所示，第一生成器G1被注入隨機雜訊N(z)，以獲得第二結構重建影像RC2。也就是說，原本第二模態之第二真實影像GT2轉換為第一模態之第二生成影像FK2後，再轉換回第二模態之第二結構重建影像RC2。 Next, in step S220, as shown in FIG. 3, the second generated image FK2 is input to the first generator G1 to obtain a second structural reconstruction image RC2. The second structural reconstruction image RC2 has a second modality. As shown in FIG. 3, the first generator G1 is injected with random noise N(z) to obtain the second structural reconstruction image RC2. That is, the second real image GT2 of the original second modality is converted into the second generated image FK2 of the first modality, and then converted back to the second structural reconstruction image RC2 of the second modality.

然後，在步驟S230中，如第3圖所示，輸入第二生成影像FK2至第二鑑別器D2，以計算一第五損失值LS5。第五損失值LS5代表轉換後之第一模態的第二生成影像FK2是否與第一模態的第一真實影像GT1相似。 Then, in step S230, as shown in FIG. 3, the second generated image FK2 is input to the second discriminator D2 to calculate a fifth loss value LS5. The fifth loss value LS5 represents whether the second generated image FK2 of the first modality after conversion is similar to the first real image GT1 of the first modality.

接著，在步驟S240中，如第3圖所示，輸入第二生成影像FK2至第二影像分割網路S2，以計算一第六損失值LS6。在此步驟中，第六損失值LS6係依據第二真實影像GT2的影像分割真值(Ground-Truth)SG2與第二生成影像FK2之影像分割進行計算。第六損失值LS6代表第二生成影像FK2的結構與第二真實影像GT2的結構是否一致或近似。在執行影像多模態轉換之生成式對抗網路的訓練過程中考慮了第六損失值LS6可以確保第二生成器G2所生成之第二生成影像FK2的物件位置不會改變。 Next, in step S240, as shown in FIG. 3, the second generated image FK2 is input to the second image segmentation network S2 to calculate a sixth loss value LS6. In this step, the sixth loss value LS6 is calculated based on the image segmentation ground truth SG2 of the second real image GT2 and the image segmentation of the second generated image FK2. The sixth loss value LS6 represents whether the structure of the second generated image FK2 is consistent or similar to the structure of the second real image GT2. The sixth loss value LS6 is considered in the training process of the generative adversarial network for performing image multimodal conversion to ensure that the object position of the second generated image FK2 generated by the second generator G2 will not change.

然後，在步驟S250中，如第3圖所示，輸入第二結構重建影像RC2至第一鑑別器D1，以計算一第七損失值LS7。第七損失值LS7代表轉換後之第二模態的第二結構重建影像RC2是否與第二模態的第二真實影像GT2相似。在生成式對抗網路的訓練過程中考慮了第七損失值LS7可以確保第二生成器G2所生成之第一模態的第二生成影像FK2也能夠正確還原回第二模態。 Then, in step S250, as shown in FIG. 3, the second structure reconstructed image RC2 is input to the first discriminator D1 to calculate a seventh loss value LS7. The seventh loss value LS7 represents whether the second structure reconstructed image RC2 of the second modality after conversion is similar to the second real image GT2 of the second modality. Considering the seventh loss value LS7 during the training process of the generative adversarial network can ensure that the second generated image FK2 of the first modality generated by the second generator G2 can also be correctly restored to the second modality.

接著，在步驟S260中，如第3圖所示，輸入第二結構重建影像RC2至第一影像分割網路S1，以計算一第八損失值LS8。在此步驟中，第八損失值LS8係依據第二真實影像GT2的影像分割真值SG2與第二結構重建影像RC2之影像分割進行計算。第八損失值LS8代表第二結構重建影像RC2的結構與第二真實影像GT2的結構是否一致或近似。在生成式對抗網路的訓練過程中考慮了第八損失值LS8可以確保第一生成器G1所生成之第二結構重建影像RC2的物件位置不會改變。 Next, in step S260, as shown in FIG. 3, the second structured reconstructed image RC2 is input to the first image segmentation network S1 to calculate an eighth loss value LS8. In this step, the eighth loss value LS8 is calculated based on the image segmentation truth value SG2 of the second real image GT2 and the image segmentation of the second structured reconstructed image RC2. The eighth loss value LS8 represents whether the structure of the second structured reconstructed image RC2 is consistent or similar to the structure of the second real image GT2. Considering the eighth loss value LS8 during the training process of the generative adversarial network can ensure that the object position of the second structured reconstructed image RC2 generated by the first generator G1 will not change.

然後，在步驟S270中，更新第一生成器G1及第二生成器G2的參數，以訓練執行影像多模態轉換之生成式對抗網路。在前進循環架構與退後循環架構結合之實施例中，第一生成器G1及第二生成器G2的參數依據第一損失值LS1、第二損失值LS2、第三損失值LS3、第四損失值LS4、第五損失值LS5、第六損失值LS6、第七損失值LS7及第八損失值LS8進行更新。 Then, in step S270, the parameters of the first generator G1 and the second generator G2 are updated to train the generative adversarial network for performing image multimodal conversion. In the embodiment of the forward loop architecture combined with the backward loop architecture, the parameters of the first generator G1 and the second generator G2 are updated according to the first loss value LS1, the second loss value LS2, the third loss value LS3, the fourth loss value LS4, the fifth loss value LS5, the sixth loss value LS6, the seventh loss value LS7 and the eighth loss value LS8.

上述內容為前進循環架構與退後循環架構的結合，在訓練執行影像多模態轉換之生成式對抗網路的過程中，透過第一損失值LS1可以確保轉換後之第二模態的第一生成影像FK1與第二模態的真實影像相似。透過第二損失值LS2可以確保第一生成器G1所生成之第一生成影像FK1的物件位置不會改變。透過第三損失值LS3可以確保第一生成器G1所生成之第二模態的第一生成影像FK1也能夠還原回第一模態。透過第四損失值LS4可以確保第二生成器G2所生成之第一結構重建影像RC1的物件位置不會改變。透過第五損失值LS5可以確保轉換後之第一模態的第二生成影像FK2與第一模態的第一真實影像GT1相似。透過第六損失值LS6可以確保第二生成影像FK2的結構與第二真實影像GT2的結構一致或近似。透過第七損失值LS7可以確保轉換後之第二模態的第二結構重建影像RC2與第二模態的第二真實影像GT2相似。透過第八損失值LS8可以確保第二結構重建影像RC2的結構與第二真實影像GT2的結構一致或近似。 The above content is a combination of a forward loop architecture and a backward loop architecture. In the process of training a generative adversarial network that performs image multimodal conversion, the first loss value LS1 can be used to ensure that the first generated image FK1 of the second modality after conversion is similar to the real image of the second modality. The second loss value LS2 can be used to ensure that the object position of the first generated image FK1 generated by the first generator G1 does not change. The third loss value LS3 can be used to ensure that the first generated image FK1 of the second modality generated by the first generator G1 can also be restored to the first modality. The fourth loss value LS4 can be used to ensure that the object position of the first structural reconstruction image RC1 generated by the second generator G2 does not change. The fifth loss value LS5 can be used to ensure that the second generated image FK2 of the first modality after conversion is similar to the first real image GT1 of the first modality. The sixth loss value LS6 can ensure that the structure of the second generated image FK2 is consistent or similar to the structure of the second real image GT2. The seventh loss value LS7 can ensure that the second structure reconstructed image RC2 of the second modality after conversion is similar to the second real image GT2 of the second modality. The eighth loss value LS8 can ensure that the structure of the second structure reconstructed image RC2 is consistent or similar to the structure of the second real image GT2.

以下更透過網路架構圖詳細說明第一損失值LS1、第二損失值LS2、第三損失值LS3、第四損失值LS4、第五損失值LS5、第六損失值LS6、第七損失值LS7及第八損失值LS8之內容及其計算方式。 The following further explains the contents and calculation methods of the first loss value LS1, the second loss value LS2, the third loss value LS3, the fourth loss value LS4, the fifth loss value LS5, the sixth loss value LS6, the seventh loss value LS7 and the eighth loss value LS8 in detail through the network architecture diagram.

請參照第5圖，其繪示執行影像多模態轉換之生成式對抗網路的網路架構圖。第一模態之第一真實影像x輸入至編碼器E_x及生成器G_x，並加入隨機雜訊N(z)後，輸出第二模態之第一生成影像

。生成器G_x的功能在於將第一模態轉換為第二模態。鑑別器D_x負責判斷轉換後的第一生成影像

是否與真實的第二真實影像y相似。 Please refer to Figure 5, which shows the network architecture of the generative adversarial network for performing image multimodal transformation. The first real image x of the first modality is input to the encoder _Ex and the generator _Gx , and after adding random noise N(z), the first generated image of the second modality is output.

The function of the generator _Gx is to convert the first mode into the second mode. The discriminator _Dx is responsible for judging the first generated image after the conversion.

Is it similar to the real second real image y?

在前進循環架構(forward cycle)中，第一損失值LS1的計算方式如以下式(1)：

In the forward cycle, the first loss value LS1 is calculated as follows:

在上式(1)中，生成器G_x於網路學習中的目的是降低損失，鑑別器D_x的目則是最大化損失。例如，若鑑別器D_x完美的運作，則上式(1)中D_x(y)會輸出1，也就是log D_x(y)=0，而對於注入雜訊的生成器G_x所產生的資料G_x(E_x(x,z),z)來說D_x(G_x(E_x(x,z),z))=0，所以log(1-D_x(G_x(E_x(x,z),z))=0。 In equation (1), the goal of the generator _Gx in network learning is to reduce the loss, and the goal of the discriminator _Dx is to maximize the loss. For example, if the discriminator _Dx works perfectly, _Dx (y) in equation (1) will output 1, that is, log _Dx (y)=0. For the data _Gx ( _Ex (x,z),z) generated by the noise-injected generator _Gx , _Dx ( _Gx ( _Ex (x,z),z))=0, so log(1- _Dx ( _Gx ( _Ex (x,z),z))=0.

上式(1)的計算方式同樣的也可被用在退後循環架構中使用相同生成器G_x與鑑別器D_x之第七損失值LS7，其方程式為

。 The calculation method of formula (1) can also be used in the same way to calculate the seventh loss value LS7 using the same generator G _x and discriminator D _x in the backward loop architecture. The equation is:

.

其中，第一模態之第二生成影像

輸入至編碼器E_x及生成器G_x，並加入隨機雜訊N(z)後，輸出第二模態之第二結構重建影像y_rec。使用相同生成器G_x與鑑別器D_x所產生的第一損失值LS1與第七損失值LS7主要的差別在於，前者是判斷轉換後之第二模態的第一生成影像

是否夠真實，後者是判斷退後循環架構中所重建之第二模態的第二結構重建影像y_rec是否夠真實。 Among them, the second generated image of the first mode

After inputting to the encoder _Ex and the generator _Gx and adding random noise N(z), the second structure reconstructed image y _rec of the second mode is output. The main difference between the first loss value LS1 and the seventh loss value LS7 generated by the same generator _Gx and discriminator _Dx is that the former is the first generated image of the second mode after the judgment conversion.

Whether it is realistic enough, the latter is to judge whether the second structure reconstructed image y _rec of the second mode reconstructed in the backward loop framework is realistic enough.

類似的，在退後循環架構中，使用生成器G_y與鑑別器D_y所產生之第五損失值LS5的計算方式如下式(2)。 Similarly, in the backward loop architecture, the fifth loss value LS5 generated by the generator G _y and the discriminator D _y is calculated as shown in the following formula (2).

其中，第二模態之第二真實影像y輸入至編碼器E_y及生成器G_y後，輸出第一模態之第二生成影像

。生成器G_y的功能在於將第二模態轉換為第一模態。鑑別器D_y負責判斷轉換後的第二生成影像

是否與真實的第一真實影像x相似。值得一提的是，在第二模態轉換至第一模態時，並不需要加入隨機雜訊N(z)。 The second real image y of the second modality is input to the encoder E _y and the generator G _y , and then the second generated image of the first modality is output.

The function of the generator G _y is to convert the second mode into the first mode. The discriminator D _y is responsible for judging the second generated image after the conversion.

Is it similar to the real first real image x? It is worth mentioning that when the second mode is converted to the first mode, there is no need to add random noise N(z).

上式(2)的計算方式同樣的也可被用在前進循環架構中使用相同生成器G_y與鑑別器D_y之第三損失值LS3，其方程式為

。 The calculation method of formula (2) can also be used in the forward loop architecture to calculate the third loss value LS3 using the same generator G _y and discriminator D _y . The equation is:

.

此外，前進循環架構之第二損失值LS2的計算方式如下。 In addition, the calculation method of the second loss value LS2 of the forward loop structure is as follows.

如第5圖所示，第一生成影像

輸入編碼器

及分割器P_y後，獲得影像分割

，第一真實影像x之影像分割真值(Ground-Truth)

與影像分割

利用比對函數l_mce進行比對。假始生成器G_x轉換的第一生成影像

真實且結構與第一真實影像x一致或近似，則第二損失值LS2則會相當的小。 As shown in Figure 5, the first generated image

Input Encoder

After the segmentation device P _y , the image segmentation is obtained.

, the ground-truth value of the first ground-truth image x

Image segmentation

The comparison function l _mce is used for comparison. The first generated image transformed by the initial generator G _x

If the image x is real and its structure is consistent or similar to the first real image x, the second loss value LS2 will be quite small.

類似的，退後循環架構之第六損失值LS6之計算方式如下式(4)。 Similarly, the calculation method of the sixth loss value LS6 of the backward loop structure is as follows (4).

如第5圖所示，第二生成影像

輸入編碼器

及分割器P_x後，獲得影像分割

，第二真實影像y之影像分割真值(Ground-Truth)

與影像分割

進行比對。假始生成器G_y轉換的第二生成影像

真實且結構與第二真實影像y一致，則第六損失值LS6則會相當的小。 As shown in Figure 5, the second generated image

Input Encoder

After the segmentation device P _x , the image segmentation is obtained.

, the ground-truth value of the second ground-truth image y

Image segmentation

The second generated image transformed by the initial generator G _y

If the image y is real and its structure is consistent with the second real image y, the sixth loss value LS6 will be quite small.

此外，前進循環架構中的第四損失值LS4之計算方式如下式(5)。 In addition, the calculation method of the fourth loss value LS4 in the forward loop structure is as follows (5).

如第5圖所示，第一結構重建影像x_rec輸入編碼器

及分割器P_x後，獲得影像分割

，第一真實影像x之影像分割真值(Ground-Truth)

與影像分割

進行比對，假始結構一致或近似，則第四損失值LS4則會相當的小。 As shown in FIG. 5 , the first structurally reconstructed image x _rec is input to the encoder.

After the segmentation device P _x , the image segmentation is obtained.

, the ground-truth value of the first ground-truth image x

Image segmentation

If the initial structures are consistent or similar, the fourth loss value LS4 will be quite small.

類似的，退後循環架構之第八損失值LS8之計算方式如下式(6)。 Similarly, the calculation method of the eighth loss value LS8 of the backward loop structure is as follows (6).

最後，整個網路目標函數為

Finally, the entire network objective function is

而整個網路最佳化的目標在於從下式(8)中找到最佳的G_x,G_y,P_x,P_y,E_x,E_y,

,

,D_x與D_y。 The goal of the entire network optimization is to find the best G _x , G _y , P _x , P _y , _Ex , E _y ,

,

,D _x and D _y .

在完成整個網路的訓練後，輸入同一張第一模態的影像與不同的隨機雜訊N(z)輸入至生成器G_x，即可產生結構與此第一模態之影像相同的數張第二模態的影像，其中第二模態之影像的整體亮度、影像中車的車燈明亮度都不同。 After the entire network is trained, the same first-modality image and different random noise N(z) are input into the generator G _x to generate several second-modality images with the same structure as the first-modality image, where the overall brightness of the second-modality images and the brightness of the car lights in the images are different.

本揭露之技術可供離線訓練行車物體偵測/語意分割/實例分割模型使用，而這些模型可運用機器學習、深度學習技術實現。 The technology disclosed herein can be used for offline training of vehicle object detection/semantic segmentation/instance segmentation models, and these models can be implemented using machine learning and deep learning technologies.

根據上述說明，本案至少具有下列優點：本案之生成式對抗網路可以轉換各種不同程度的模態(如行車天候)，且影像中的標記位置不會改變或近似。 According to the above description, this case has at least the following advantages: The generative adversarial network in this case can transform various modes of varying degrees (such as driving weather), and the position of the markers in the image will not change or will be similar.

本案所生成之生成影像，由於標記位置不會改變或近似，因此原始真實影像的標記可持續使用，因而可大量降低人工標記成本。 The generated images in this case will not change or approximate the marking positions, so the markings of the original real images can be used continuously, thus greatly reducing the cost of manual marking.

此外，本案可以透過一張真實影像及其標記轉換出數張其他模態的生成影像，這些其他模態的生成影像可用於訓練或是提升其他模態下，物體偵測/語意分割/實例分割模型的辨識率。 In addition, this case can convert a real image and its labeling into several generated images of other modalities. These generated images of other modalities can be used to train or improve the recognition rate of object detection/semantic segmentation/instance segmentation models in other modalities.

綜上所述，雖然本案已以實施例揭露如上，然其並非用以限定本案。本案所屬技術領域中具有通常知識者，在不脫離本案之精神和範圍內，當可作各種之更動與潤飾。因此，本案之保護範圍當視後附之申請專利範圍所界定者為準。 In summary, although the present invention has been disclosed as an embodiment, it is not intended to limit the present invention. Those with common knowledge in the technical field to which the present invention belongs can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention shall be subject to the scope of the patent application attached hereto.

100:執行影像多模態轉換之生成式對抗網路之訓練系統 100: Training system for generative adversarial networks for image multimodal transformation

D1:第一鑑別器 D1: First discriminator

D2:第二鑑別器 D2: Second discriminator

FK1:第一生成影像 FK1: First generated image

G1:第一生成器 G1: First generator

G2:第二生成器 G2: Second generator

GT1:第一真實影像 GT1: The first real image

LS1:第一損失值 LS1: First loss value

LS2:第二損失值 LS2: Second loss value

LS3:第三損失值 LS3: Third loss value

LS4:第四損失值 LS4: Fourth loss value

N(z):隨機雜訊 N(z): Random noise

RC1:第一結構重建影像 RC1: First structural reconstruction image

S1:第一影像分割網路 S1: The first image segmentation network

S2:第二影像分割網路 S2: Second image segmentation network

Claims

A training method for a generative adversarial network for performing image multimodal conversion, wherein the generative adversarial network is used for performing structure-preserving image multimodal conversion, and the training method for the generative adversarial network for performing image multimodal conversion includes: inputting a first real image to a first generator to obtain a first generated image, wherein the first real image has a first mode, and the first generated image has a second mode; the first generator inputs the first generated image to a second generator to obtain a first structural reconstruction image, wherein the first structural reconstruction image has the first mode; the first generator inputs the first generated image to a first discriminator to calculate a first loss value; the first generator inputs the first generated image to a first discriminator to calculate a first loss value; The first generator inputs the first structure reconstructed image to a second discriminator to calculate a third loss value; the second generator inputs the first structure reconstructed image to a second image segmentation network to calculate a fourth loss value; and the first generator and the second generator update the parameters of the first generator and the second generator to train the generative adversarial network for performing image multimodal conversion, the parameters of the first generator and the second generator are updated at least according to the first loss value, the second loss value, the third loss value and the fourth loss value, and the updated generative adversarial network is used for performing structure-preserving image multimodal conversion so that the expanded image does not need to be re-labeled.

A method for training a generative adversarial network for performing image multimodal conversion as described in claim 1, wherein the first loss value is calculated based on a second real image.

A method for training a generative adversarial network for performing multimodal image conversion as described in claim 1, wherein the second loss value is calculated based on a true image segmentation value of the first true image.

A method for training a generative adversarial network for performing image multimodal transformation as described in claim 1, wherein the third loss value is calculated based on the first true image.

A method for training a generative adversarial network for performing multimodal image conversion as described in claim 1, wherein the fourth loss value is calculated based on a true image segmentation value of the first true image.

The training method of the generative adversarial network for performing image multimodal conversion as described in claim 1 further includes: inputting a second real image to the second generator to obtain a second generated image, the second real image has the second modality, and the second generated image has the first modality; inputting the second generated image to the first generator to obtain a second structural reconstruction image, the second structural reconstruction image has the second modality; inputting the second generated image to the second discriminator to calculate a fifth loss value; inputting the The second generated image is input to the second image segmentation network to calculate a sixth loss value; the second structure reconstructed image is input to the first discriminator to calculate a seventh loss value; and the second structure reconstructed image is input to the first image segmentation network to calculate an eighth loss value; wherein the parameters of the first generator and the second generator are updated according to the first loss value, the second loss value, the third loss value, the fourth loss value, the fifth loss value, the sixth loss value, the seventh loss value and the eighth loss value.

A method for training a generative adversarial network for performing image multimodal conversion as described in claim 6, wherein the fifth loss value is calculated based on the first true image.

A method for training a generative adversarial network for performing image multimodal conversion as described in claim 6, wherein the sixth loss value is calculated based on an image segmentation true value of the second real image.

A method for training a generative adversarial network for performing image multimodal conversion as described in claim 6, wherein the seventh loss value is calculated based on the second real image.

A method for training a generative adversarial network for performing image multimodal conversion as described in claim 6, wherein the eighth loss value is calculated based on an image segmentation truth value of the second real image.

A method for training a generative adversarial network for performing multimodal image conversion as described in claim 1, wherein the first mode and the second mode include daytime, nighttime, rainy days, and snowy days.

A method for training a generative adversarial network for performing image multimodal conversion as described in claim 1, wherein the first image segmentation network and the second image segmentation network are semantic segmentation networks.

A method for training a generative adversarial network for performing image multimodal conversion as described in claim 1, wherein the first generator further obtains the first generated image based on a random noise.

A training system for a generative adversarial network for performing image multimodal conversion, wherein the generative adversarial network performs structure-preserving image multimodal conversion, and the training system for the generative adversarial network for performing image multimodal conversion comprises: a first generator, receiving a first real image to obtain a first generated image, wherein the first real image has a first modality, and the first generated image has a second modality; a second generator, receiving the first generated image to obtain a first structural reconstruction image, wherein the first structural reconstruction image has the first modality; a first discriminator, receiving the first generated image to calculate a first loss value; a second A second discriminator receives the first structural reconstruction image to calculate a second loss value; a first image segmentation network receives the first generated image to calculate a third loss value; and a second image segmentation network receives the first structural reconstruction image to calculate a fourth loss value; wherein the parameters of the first generator and the second generator are updated at least according to the first loss value, the second loss value, the third loss value and the fourth loss value to train the generative adversarial network for performing image multimodal conversion, and the updated generative adversarial network is used for performing structure-preserving image multimodal conversion so that the augmented image does not need to be re-labeled.

A training system for a generative adversarial network that performs image multimodal transformation as described in claim 14, wherein the first loss value is calculated based on a second real image.

A training system for a generative adversarial network that performs image multimodal transformation as described in claim 14, wherein the second loss value is calculated based on the first real image.

A training system for a generative adversarial network that performs image multimodal transformation as described in claim 14, wherein the third loss value is calculated based on a true image segmentation value of the first true image.

A training system for a generative adversarial network that performs multimodal image conversion as described in claim 14, wherein the fourth loss value is calculated based on a true image segmentation value of the first real image.

A training system for a generative adversarial network that performs image multimodal conversion as described in claim 14, wherein the second generator further receives a second real image to obtain a second generated image, the second real image has the second modality, and the second generated image has the first modality; the first generator further receives the second generated image to obtain a second structural reconstruction image, the second structural reconstruction image has the second modality; the second discriminator further receives the second generated image to calculate a fifth loss value; the first discriminator further receives the second structural reconstruction image The second image segmentation network further receives the second generated image to calculate a seventh loss value; and the first image segmentation network further receives the second structural reconstruction image to calculate an eighth loss value; wherein the parameters of the first generator and the second generator are updated according to the first loss value, the second loss value, the third loss value, the fourth loss value, the fifth loss value, the sixth loss value, the seventh loss value and the eighth loss value to train the generative adversarial network that performs image multimodal conversion.

A training system for a generative adversarial network that performs image multimodal transformation as described in claim 19, wherein the fifth loss value is calculated based on the first real image.

A training system for a generative adversarial network that performs image multimodal transformation as described in claim 19, wherein the sixth loss value is calculated based on the second real image.

A training system for a generative adversarial network that performs image multimodal transformation as described in claim 19, wherein the seventh loss value is calculated based on a true image segmentation value of the second real image.

A training system for a generative adversarial network that performs image multimodal transformation as described in claim 19, wherein the eighth loss value is calculated based on a true image segmentation value of the second real image.

A training system for a generative adversarial network that performs multimodal image conversion as described in claim 14, wherein the first mode and the second mode include daytime, nighttime, sunny days, rainy days, and snowy days.

A training system for a generative adversarial network for performing image multimodal transformation as described in claim 14, wherein the first image segmentation network and the second image segmentation network are semantic segmentation networks.

A training system for a generative adversarial network that performs image multimodal transformation as described in claim 14, wherein the first generator further obtains the first generated image based on a random noise.