KR102437959B1

KR102437959B1 - Device for Unsupervised Domain Adaptation in Semantic Segmentation Exploiting Inter-pixel Correlations and Driving Method Thereof

Info

Publication number: KR102437959B1
Application number: KR1020220035170A
Authority: KR
Inventors: 정인섭; 유자연; 곽노준; 나종근
Original assignee: 주식회사 스누아이랩; 서울대학교 산학협력단
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-08-30

Abstract

The present invention relates to a device for unsupervised domain adaptation in semantic segmentation exploiting inter-pixel correlations and a driving method thereof. According to an embodiment of the present invention, the device for unsupervised domain adaptation in semantic segmentation exploiting inter-pixel correlations comprises: a storage unit which subdivides a domain included in a source image by meaning by using label information of a source image, and stores, as source domain data, characteristic information subdivided and analyzed by meaning; and a control unit which generates pseudo-label information of a target image having no label information by using a learning model learning the stored source domain data and performs learning by adapting the domain of the target image to the (pre-)stored source domain data by supplementing an error part of the pseudo-label information generated based on the learning results of the inter-pixel correlations learned using the source domain data. Therefore, as a segmentation network learns the inter-pixel correlations learned by a self-focusing module, the performance of unsupervised domain adaptation in semantic segmentation can be increased.

Description

Device for Unsupervised Domain Adaptation in Semantic Segmentation Exploiting Inter-pixel Correlations and Driving Method Thereof

본 발명은 픽셀간 상관관계를 활용하는 시멘틱 세그멘테이션에서의 비지도 도메인 적응 방법 및 그 장치에 관한 것으로서, 더 상세하게는 가령 레이블이 없는 타겟 도메인에 대하여 미리 나이브(naive) 하게 학습된 모델을 사용하여 의사 레이블(pseudo labels)을 만들고 그 만들어진 의사 레이블을 사용하여 타겟 도메인에 대해 학습할 때 희소하고 잘못된 의사 레이블의 결함을 보완하기 위한 방법을 제시한다. 모델이 예측한 값들에 대해 자기 집중 모듈(self-attention module)을 활용해 자기-상관관계 맵(self-attention map)을 만들고 이를 통해 픽셀들 간 상관관계를 반영하여 만들어진 향상된 예측값을 만들고 모델의 예측값이 이 향상된 예측값을 따라가도록 학습을 하여 성능을 향상시킨다. 이는 도메인간 불변 속성인 픽셀 간 상관 관계를 이용하여 픽셀간 상관관계를 활용하는 시멘틱 세그멘테이션에서의 비지도 도메인 적응 방법 및 그 장치에 관한 것이다.The present invention relates to a method and apparatus for unsupervised domain adaptation in semantic segmentation utilizing inter-pixel correlation. We present a method to create pseudo labels and compensate for the defects of sparse and incorrect pseudo-labels when learning about the target domain using the pseudo-labels. For the values predicted by the model, a self-attention module is used to create a self-attention map, and through this, an improved prediction value made by reflecting the correlation between pixels is created, and the prediction value of the model This improves performance by training it to follow these improved predictions. This relates to an unsupervised domain adaptation method in semantic segmentation using inter-pixel correlation using inter-pixel correlation, which is an invariant inter-domain property, and an apparatus therefor.

UDA(Unsupervised Domain Adaptation)는 로봇 및 자율 주행을 위한 시뮬레이션 교육과 같이 레이블이 부족한 실제 문제를 해결하는 주요 방법 중 하나가 되었다. 그것은 풍부한 레이블이 있는 소스 도메인에서 학습한 심층 신경망의 지식을 레이블 정보가 없는 타겟 도메인으로 이전한다. 일반적으로 소스 도메인은 사실적인 합성 데이터 집합(set)이고 타겟 도메인은 실제 세계 이미지이다. UDA는 힘들고 시간이 많이 걸리는 조밀한 픽셀 수준 레이블링 프로세스가 필요하기 때문에 의미론적 분할에 특히 유용하다. Unsupervised Domain Adaptation (UDA) has become one of the main ways to solve real-world problems that lack labels, such as simulation training for robots and autonomous driving. It transfers the knowledge of a deep neural network learned from a source domain with rich labels to a target domain without label information. In general, the source domain is a realistic synthetic data set and the target domain is a real-world image. UDAs are particularly useful for semantic segmentation because they require a laborious and time-consuming process of dense pixel-level labeling.

최근에는 의미론적 분할을 위한 UDA를 가능하게 하는 "자가 훈련(self-training)" 방식이 지배적인 방법이 되었다. 이 방법은 또한 자가 훈련 방식을 기반으로 하는 의미론적 분할을 위한 UDA를 다루지만 소스 도메인의 픽셀 간 상관 관계를 타겟 도메인으로 전달하는 새롭고 간단하고 효과적인 방법을 소개하기도 한다. 자가 훈련의 개념은 레이블이 지정되지 않은 타겟 도메인에 대한 지도(supervision)도 제공하는 것이다. 훈련된 모델을 사용하여 타겟 도메인에 대한 의사(혹은 가상의) 레이블(pseudo label) 집합을 생성한 다음 그 생성된 의사 레이블로 새 모델을 다시 훈련시킨다.Recently, the "self-training" approach that enables UDAs for semantic segmentation has become the dominant method. This method also deals with UDAs for semantic segmentation based on self-training approach, but also introduces a new, simple and effective way to convey the correlation between pixels in the source domain to the target domain. The concept of self-training is to also provide a supervision of unlabeled target domains. Using the trained model, a set of pseudo (or hypothetical) labels for the target domain is generated, and then the new model is retrained with the generated pseudo labels.

성능 향상을 위해 재학습된 모델로 이 과정을 반복한다. 그러나 의사 레이블은 예측된 신뢰도가 특정 임계값을 초과하는 신뢰성있는 픽셀을 기반으로 하기 때문에 항상 정확하지는 않으며 일부 픽셀에는 의사 레이블이 없는 매우 드문 경우가 있다. 이러한 문제를 해결하기 위하여 최근 연구에서는 더 나은 의사 레이블을 생성할 것을 제안한 바 있다. 해당 기술들은 더 정확한 의사 레이블을 제공하기 위해 의사 레이블의 노이즈 제거, 수정 및 조밀화한다. 그러나 이러한 방법 또한 다소 체험적(heuristic)이며, 복잡한 알고리즘이 필요한 문제가 있다.Repeat this process with the retrained model to improve performance. However, pseudo-labels are not always accurate, as they are based on reliable pixels whose predicted confidence exceeds a certain threshold, and there are very rare cases where some pixels do not have pseudo-labels. In order to solve this problem, a recent study has proposed to generate a better pseudo label. These techniques denoise, correct, and densify pseudo-labels to provide more accurate pseudo-labels. However, this method is also somewhat heuristic, and there is a problem that a complex algorithm is required.

또한 의사 레이블은 픽셀이 예측되어야 하는 클래스(class) 즉 (객체 등의) 부류에 대한 정보만 제공하지만 각 픽셀이 다른 픽셀과 어떻게 연관되거나 유사해야 하는지에 대한 정보는 제공하지 않는다. 의미론적 세분화 작업의 출력이 고도로 구조화되고 상관 관계가 있기 때문에 픽셀 간의 상관 관계에 대한 지침을 제공하는 것이 무엇보다 중요하다.Pseudo-labels also provide information only about the class from which the pixel should be predicted, i.e. (such as an object), but not how each pixel should be related or similar to other pixels. Because the output of the semantic segmentation task is highly structured and correlated, it is of paramount importance to provide guidance on the correlation between pixels.

종래에는 비지도 도메인 적응 기술이 공지된 바 있다. 레이블이 지정된 소스 데이터와 레이블이 지정되지 않은 타겟 데이터 간의 도메인 격차를 해결하기 위해 비지도 도메인 적응이 널리 연구되었다. 도메인 적응은 실제 환경에서 중요한 문제이기 때문에 분류, 분할과 같은 다양한 작업에서 연구되어 왔다. 도메인 적응 접근 방식에는 적대적 학습, 이미지 변환 및 자가 훈련 등의 세 가지 주요 라인이 있다. 이론적 분석에 기초하여, 적대적 학습 방식은 소스 도메인에서 분할 손실을 줄이는 동시에 판별자의 도메인 식별 손실을 증가시키려는 적대적 학습을 통해 분할 모델을 타겟 도메인에 적용한다. 이미지 변환 방식은 이미지 단에서 소스 도메인의 이미지를 타겟 도메인 이미지와 비슷한 스타일또는 재질로 만들어 이미지단에서의 도메인 차이를 줄이려는 방식이다. CycleGAN과 푸리에 변환을 사용하여 타겟 도메인 스타일로 변환하는 방식이 제안되었다. Conventionally, unsupervised domain adaptation techniques have been known. Unsupervised domain adaptation has been widely studied to address the domain gap between labeled source data and unlabeled target data. Since domain adaptation is an important problem in the real environment, it has been studied in various tasks such as classification and segmentation. There are three main lines of domain adaptation approaches: adversarial learning, image transformation, and self-training. Based on the theoretical analysis, the adversarial learning method applies the segmentation model to the target domain through adversarial learning to reduce the segmentation loss in the source domain while increasing the discriminator's domain identification loss. The image conversion method is a method to reduce the domain difference in the image stage by making the image of the source domain in a style or material similar to the image of the target domain in the image stage. A method of transforming into a target domain style using CycleGAN and Fourier transform has been proposed.

또한, 종래에는 자가 훈련(self-training) 기술이 공개된 바 있다. 비지도 도메인 적응의 또다른 라인은 의미론적 분할에서 처음 도입된 바 있는 자가 훈련이다. 소스 도메인에서 훈련된 모델을 사용하여 의사 레이블을 생성하고 의사 레이블이 지정된 타겟 도메인 이미지로 모델을 다시 훈련하여 암시적으로 클래스별(혹은 클래스 단위로)(class-wise) 특징 정렬(feature alignment)을 장려하고 대상 도메인에 적응할 수 있도록 했다. 그러나 도메인 불일치로 인해 의사 레이블에 노이즈가 발생하기 때문에 이 문제를 해결하기 위한 많은 방법이 제안되었다.In addition, conventionally, self-training techniques have been disclosed. Another line of unsupervised domain adaptation is self-training, which was first introduced in semantic segmentation. Generate pseudo-labels using a model trained in the source domain and retrain the model with pseudo-labeled target domain images to implicitly perform class-wise feature alignment. Incentive and adaptable to the target domain. However, many methods have been proposed to solve this problem because the domain mismatch causes pseudo-label noise.

이와 관련해 종래에는 부드러운 의사 레이블을 생성하고 모델의 출력을 평활화하여 지나치게 신뢰하는 레이블에 대한 과적합을 방지하는 정규화 방법을 소개한다. 또한, 종래에는 각각 클래스 프로토타입으로부터의 특징 거리와 두 분할 모델의 출력 간의 분산에 기반하여 잡음이 있는 의사 레이블에 대한 가중치를 낮춘다. 또한, 다른 종래 기술은 실측 레이블과 의사 레이블 사이에 천이 확률 맵(transition probability map)을 도입하여 사전 분포를 기반으로 의사 레이블을 수정하는 것과 동일한 효과가 있다. 자가 훈련의 또 다른 한계인 신뢰할 만한 의사 레이블(confident pseudo labels)의 희소성을 해결하기 위해 인접하는 픽셀의 신뢰 의사 레이블들을 사용하는 의사 레이블의 고밀도화(densificaton)를 제안한다.In this regard, a regularization method that generates soft pseudo-labels and smoothes the output of the model to prevent overfitting of over-reliable labels is introduced. Also, in the related art, the weight for the noisy pseudo-label is lowered based on the feature distance from each class prototype and the variance between the outputs of the two segmentation models. In addition, another prior art has the same effect as modifying a pseudo label based on a prior distribution by introducing a transition probability map between the actual label and the pseudo label. To solve another limitation of self-training, the sparseness of reliable pseudo labels, we propose a densification of pseudo labels using the trusted pseudo labels of adjacent pixels.

나아가, 종래에는 비전(vision)을 위한 자기 집중(혹은 주의)(self-attention) 기술이 공지된 바 있다. 자기 집중은 가령 이미지에서 모든 위치들과 상호 작용하고, 더 두드러진(salient) 위치들에 더 집중하거나 주의를 기울임으로써 순서대로 어느 위치에서의 응답을 계산한다. 집중 모듈(attention modules)을 컨볼루션 네트워크에 도입하려는 여러 시도가 있었다. 비-로컬 네트워크는 위치 거리에 관계없이 두 위치 간의 상호 작용을 계산하여 장거리 종속성을 캡처 즉 획득한다. 또다른 기술은 다양한 비전 작업에 대해 채널 및 공간적인 집중을 사용하여 기능을 개선한다. 최근에는 컨볼루션 아키텍처가 아닌 집중(attention) 메커니즘으로 구성된 트랜스포머 기반 모델도 비전 작업에서 활발히 연구되고 있다.Furthermore, conventionally, a self-attention (or self-attention) technique for vision has been known. Self-focusing computes the response at a location in sequence by, for example, interacting with all locations in the image and paying more attention or focus to the more salient locations. Several attempts have been made to introduce attention modules into convolutional networks. Non-local networks capture or obtain long-range dependencies by calculating the interaction between two locations regardless of location distance. Another technique improves functionality by using channel and spatial focus for various vision tasks. Recently, a transformer-based model composed of an attention mechanism rather than a convolutional architecture is also being actively studied in vision work.

앞서 살펴본 바와 같이 종래에는 레이블이 없는 타겟 도메인에 대하여 미리 나이브하게 학습된 모델을 사용하여 의사 레이블을 만든다. 그리고 그 만들어진 의사 레이블로 타겟 도메인을 지도하여 성능을 대폭적으로 향상하는 것이 가능한다.As described above, in the related art, a pseudo-label is created using a naive model trained in advance for a non-labeled target domain. And it is possible to significantly improve the performance by guiding the target domain with the created pseudo-label.

그러나, 여전히 부분부분 구멍난 부분(예: 검정색 부분)이 많고 모델이 예측(predict)한 값이기 때문에 틀린 부분도 많은 문제가 있다. 즉 100% 정확한 값으로 지도를 한다고 볼 수 있다. 그래서, 종래에는 앞서 기술한 바와 같이 의사 레이블을 수정(correction)하거나 고밀도화하여 이러한 한계점을 극복하려 하고 있지만, 아직까지 많이 미흡한 문제가 있다.However, there are still many parts with holes (eg, black parts), and there are many problems with errors because it is the value predicted by the model. In other words, it can be seen that the map is performed with 100% accurate values. Thus, as described above, conventional attempts have been made to overcome these limitations by correcting or densifying the pseudo-label as described above, but there are still many insufficient problems.

한국공개특허공보 제10-2021-0129503호(2021.10.28)Korean Patent Publication No. 10-2021-0129503 (October 28, 2021) 한국공개특허공보 제10-2022-0008208호(2022.01.20)Korean Patent Publication No. 10-2022-0008208 (202.01.20)

본 발명의 실시예는 가령 레이블이 없는 타겟 도메인에 대하여 미리 나이브 가령 예측에 의해 학습된 모델을 사용하여 의사 레이블을 만들고 그 만들어진 의사 레이블을 타겟 도메인에 대한 지도(supervision)를 줄 수 있는 레이블로서 활용하여 성능을 향상하려 할 때 도메인 불변 속성인 픽셀 간 상관 관계를 이용하여 희소하고 잘못된 의사 레이블의 결함을 보완하는 픽셀간 상관관계를 활용하는 시멘틱 세그멘테이션에서의 비지도 도메인 적응 방법 및 그 장치를 제공함에 그 목적이 있다.In an embodiment of the present invention, for example, a pseudo label is created using a model trained by naive prediction in advance for a target domain without a label, and the created pseudo label is used as a label that can provide a supervision for the target domain. To provide an unsupervised domain adaptation method and apparatus in semantic segmentation that utilizes inter-pixel correlation that compensates for defects in sparse and incorrect pseudo-labels using inter-pixel correlation, which is a domain invariant property, to improve performance. There is a purpose.

본 발명의 실시예에 따른 픽셀간 상관관계를 활용하는 시멘틱 세그멘테이션에서의 비지도 도메인 적응 장치는, 레이블 정보가 있는 소스 이미지의 상기 레이블 정보를 이용하여 상기 소스 이미지에 포함되는 도메인을 의미별로 세분화하여 분석되는 특징 정보를 소스 도메인 데이터로서 저장하는 저장부, 및 상기 저장한 소스 도메인 데이터를 학습한 학습 모델을 이용해 레이블 정보가 없는 타겟 이미지의 의사 레이블 정보를 생성하고, 상기 소스 도메인 데이터를 이용해 학습한 픽셀간 상관 관계의 학습 결과를 근거로 상기 생성한 의사 레이블 정보의 오류 부분을 보완하는 방식으로 상기 타겟 이미지의 도메인을 상기 저장한 소스 도메인 데이터에 적응시켜 학습을 수행하는 제어부를 포함한다.The apparatus for unsupervised domain adaptation in semantic segmentation utilizing inter-pixel correlation according to an embodiment of the present invention uses the label information of a source image with label information to subdivide domains included in the source image by meaning, A storage unit that stores the analyzed feature information as source domain data, and a learning model that has learned the stored source domain data to generate pseudo label information of a target image without label information, and learn using the source domain data. and a controller configured to perform learning by adapting the domain of the target image to the stored source domain data in a manner that compensates for the error part of the generated pseudo label information based on the learning result of the inter-pixel correlation.

상기 제어부는, 상기 소스 도메인 데이터를 이용해 상기 픽셀간 상관 관계를 학습할 때 지상 실측 정보와 관련한 GT 레이블 정보가 있는 소스 도메인 데이터에 한하여 픽셀간 상관 관계를 학습할 수 있다.When learning the inter-pixel correlation using the source domain data, the controller may learn the inter-pixel correlation only for source domain data having GT label information related to ground measurement information.

상기 제어부는, 상기 GT 레이블 정보로서 도메인 불변 속성 (픽셀들 간의 상관관계)을 갖는 정보를 이용할 수 있고 구체적으로 소스 도메인의 GT 레이블을 활용한다.The control unit may use information having a domain invariant property (correlation between pixels) as the GT label information, and specifically utilizes the GT label of the source domain.

상기 제어부는 상기 타겟 이미지의 의미별로 세분화한 도메인과 상기 오류 부분을 변경한 의사 레이블 정보를 매칭시켜 타겟 도메인 데이터를 생성하고, 새로운 타겟 이미지의 학습시 상기 생성한 타겟 도메인 데이터를 이용할 수 있다.The controller may generate target domain data by matching the domain subdivided for each meaning of the target image with the pseudo label information obtained by changing the error part, and may use the generated target domain data when learning a new target image.

상기 제어부는, 상기 타겟 이미지의 도메인을 의미별로 세분화하는 세그멘테이션부의 출력과 상기 픽셀간 상관 관계를 학습하는 자기 집중부의 출력을 서로 연결하여 데이터를 동기화 처리할 수 있다.The controller may synchronize the data by connecting the output of the segmentation unit that subdivides the domain of the target image by meaning and the output of the magnetic concentration unit that learns the correlation between the pixels with each other.

또한, 본 발명의 실시예에 따른 픽셀간 상관관계를 활용하는 시멘틱 세그멘테이션에서의 비지도 도메인 적응 장치의 구동방법은, 저장부가 레이블 정보가 있는 소스 이미지의 상기 레이블 정보를 이용하여 상기 소스 이미지에 포함되는 도메인을 의미별로 세분화하여 분석되는 특징 정보를 소스 도메인 데이터로서 저장하는 단계, 및 제어부가 상기 저장한 소스 도메인 데이터를 학습한 학습 모델을 이용해 레이블 정보가 없는 타겟 이미지의 의사 레이블 정보를 생성하고, 상기 소스 도메인 데이터를 이용해 학습한 픽셀간 상관 관계의 학습 결과를 근거로 상기 생성한 의사 레이블 정보의 오류 부분을 보완하는 방식으로 상기 타겟 이미지의 도메인을 상기 저장한 소스 도메인 데이터에 적응시켜 학습을 수행하는 단계를 포함한다.In addition, in the method of driving an apparatus for unsupervised domain adaptation in semantic segmentation using inter-pixel correlation according to an embodiment of the present invention, a storage unit includes the source image using the label information of the source image with label information. Storing the analyzed characteristic information as source domain data by subdividing the domain to be used as source domain data, and generating pseudo-label information of the target image without label information using a learning model that learned the stored source domain data by the control unit, Learning is performed by adapting the domain of the target image to the stored source domain data in a way that compensates for the error part of the generated pseudo label information based on the learning result of the correlation between pixels learned using the source domain data. including the steps of

상기 학습을 수행하는 단계는, 상기 소스 도메인 데이터를 이용해 상기 픽셀간 상관 관계를 학습할 때 지상 실측 정보와 관련한 GT 레이블 정보가 있는 소스 도메인 데이터에 한하여 픽셀간 상관 관계를 학습할 수 있다.In the performing of the learning, when learning the inter-pixel correlation using the source domain data, the inter-pixel correlation may be learned only for source domain data having GT label information related to ground measurement information.

상기 학습을 수행하는 단계는, 상기 GT 레이블 정보로서 도메인 불변 속성 (픽셀들 간의 상관관계)을 갖는 정보를 이용할 수 있고 구체적으로 소스 도메인의 GT 레이블을 활용할 수 있다.In the performing of the learning, information having a domain invariant property (correlation between pixels) may be used as the GT label information, and specifically, a GT label of a source domain may be used.

상기 학습을 수행하는 단계는, 상기 타겟 이미지의 의미별로 세분화한 도메인과 상기 오류 부분을 변경한 의사 레이블 정보를 매칭시켜 타겟 도메인 데이터를 생성하는 단계, 및 새로운 타겟 이미지의 학습시 상기 생성한 타겟 도메인 데이터를 이용하는 단계를 포함할 수 있다.The performing of the learning may include generating target domain data by matching the domain subdivided by meaning of the target image with the pseudo label information in which the error part is changed, and the generated target domain when learning a new target image. It may include using the data.

상기 학습을 수행하는 단계는, 상기 타겟 이미지의 도메인을 의미별로 세분화하는 세그멘테이션부의 출력과 상기 픽셀간 상관 관계를 학습하는 자기 집중부의 출력을 서로 연결하여 데이터를 동기화 처리할 수 있다.In the performing of the learning, the output of the segmentation unit that subdivides the domain of the target image by meaning and the output of the magnetic concentration unit that learns the correlation between the pixels may be connected to each other to synchronize data.

본 발명의 실시예에 따르면 세그멘테이션 네트워크가 자기 집중 모듈이 학습한 픽셀간 상관 관계를 학습함으로 인해 시멘틱 세그멘테이션을 위한 비지도 도메인 적응 동작의 성능을 개선할 수 있을 것이다.According to an embodiment of the present invention, the performance of the unsupervised domain adaptation operation for semantic segmentation may be improved because the segmentation network learns the inter-pixel correlation learned by the self-focusing module.

도 1은 본 발명의 실시예에 따른 비지도 도메인 적응 장치의 세부구조를 예시한 블록다이어그램,
도 2는 비지도 도메인 적응 설정을 위해 사용되는 소스 이미지, 타겟 이미지, 그리고 GT 이미지의 예시도,
도 3은 기존 의사 레이블을 통한 자가 훈련 방식으로 생성된 의사 레이블의 예시,
도 4는 본 발명의 실시예에 따른 자기 집중 모듈의 프레임워크를 보여주는 도면,
도 5는 자기 집중 모듈과 세그멘테이션 네트워크의 훈련을 설명하기 위한 도면,
도 6은 집중과 예측 시각화를 비교하여 보여주는 도면, 그리고
도 7은 본 발명의 실시예에 따른 비지도 도메인 적응 장치의 구동과정을 나타내는 흐름도이다.1 is a block diagram illustrating a detailed structure of an unsupervised domain adaptation apparatus according to an embodiment of the present invention;
2 is an exemplary diagram of a source image, a target image, and a GT image used for unsupervised domain adaptation setting;
3 is an example of a doctor label generated by a self-training method through an existing doctor label;
4 is a view showing a framework of a magnetic concentration module according to an embodiment of the present invention;
5 is a diagram for explaining training of a self-focusing module and a segmentation network;
6 shows a comparison of focused and predictive visualization, and
7 is a flowchart illustrating a driving process of an apparatus for unsupervised domain adaptation according to an embodiment of the present invention.

이하, 도면을 참조하여 본 발명의 실시예에 대하여 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 실시예에 따른 비지도 도메인 적응 장치의 세부구조를 예시한 블록다이어그램, 도 2는 비지도 도메인 적응 설정을 위해 사용되는 소스 이미지, 타겟 이미지, 그리고 GT 이미지의 예시도, 그리고 도 3은 기존 의사 레이블을 통한 자가 훈련 방식으로 생성된 의사 레이블의 예시이다.1 is a block diagram illustrating the detailed structure of an unsupervised domain adaptation apparatus according to an embodiment of the present invention, FIG. 2 is an exemplary view of a source image, a target image, and a GT image used for unsupervised domain adaptation setting, and 3 is an example of a doctor label generated by a self-training method using an existing doctor label.

도 1에 도시된 바와 같이, 본 발명의 실시예에 따른 비지도 도메인 적응 장치(900)는 픽셀간 상관 관계를 활용하는 시멘틱 세그멘테이션에서의 비지도 도메인 적응 장치로서 통신 인터페이스부(100), 제어부(110), 비지도 도메인 적응부(120) 및 저장부(130)의 일부 또는 전부를 포함한다.1, an unsupervised domain adaptation apparatus 900 according to an embodiment of the present invention is an unsupervised domain adaptation apparatus in semantic segmentation utilizing inter-pixel correlation, including a communication interface unit 100, a control unit ( 110), some or all of the unsupervised domain adaptation unit 120 and the storage unit 130 .

여기서, "일부 또는 전부를 포함한다"는 것은 저장부(130)와 같은 일부 구성요소가 생략되어 비지도 도메인 적응 장치(90)가 구성되거나, 비지도 도메인 적응부(120)와 같은 일부 구성요소가 제어부(110)와 같은 다른 구성요소에 통합되어 구성될 수 있는 것 등을 의미하는 것으로서, 발명의 충분한 이해를 돕기 위하여 전부 포함하는 것으로 설명한다.Here, “including some or all” means that some components such as the storage unit 130 are omitted to configure the unsupervised domain adaptation device 90 or some components such as the unsupervised domain adaptation unit 120 . It means that it can be configured by being integrated with other components such as the control unit 110, and it will be described as including all in order to help a sufficient understanding of the invention.

구체적인 설명에 앞서, 본 발명의 실시예에 따른 비지도 도메인 적응 장치(90)는 (학습)모델에 대해 소스 도메인(source domain)에 대한 지도 학습(supervised learning)이 이루어짐과 동시에 타겟 도메인(target domain)에 대해 비지도 도메인 적응(unsupervised domain adaptation)시키기 위한 장치로서, 인공지능이 적용되는 자율주행차, 데스크탑컴퓨터나 랩탑컴퓨터, 그리고 서버 등에 구성될 수 있으며, 여기서 서버는 통신망에 연결되어 동작할 수 있다. 또한, 소스 도메인은 데이터와 레이블이 쌍(pair)을 포함하는 도메인이며, 타겟 도메인은 데이터만을 포함하는 도메인이다. 소스 도메인은 게임 영상 등을 통해 취득한 데이터가 해당될 수 있으며, 타겟 도메인은 카메라 촬영과 같은 실사 도메인의 데이터를 포함할 수 있다. 예컨대 자율주행차 분야에 본 발명의 실시예가 적용되는 경우, 소스 도메인은 사전에 데이터를 구축되어 기저장될 수 있으며, 타겟 도메인은 카메라에 의해 촬영된 실사 기반의 촬영 이미지로 레이블을 포함하지 않을 수 있다. 따라서, 실사 기반의 타겟 도메인을 적응시키기 위해 소스 도메인을 활용한다.Prior to the detailed description, in the unsupervised domain adaptation apparatus 90 according to an embodiment of the present invention, supervised learning is performed on a source domain for a (learning) model and at the same time, a target domain (target domain) is performed. ) as a device for unsupervised domain adaptation, it can be configured in an autonomous vehicle to which artificial intelligence is applied, a desktop computer or a laptop computer, and a server, where the server can be operated by being connected to a communication network. have. In addition, the source domain is a domain including a pair of data and a label, and the target domain is a domain including only data. The source domain may correspond to data acquired through a game image, etc., and the target domain may include data of a live-action domain such as camera shooting. For example, when an embodiment of the present invention is applied to the field of autonomous vehicles, the source domain may be pre-built and stored data, and the target domain may not include a label as a live-action-based photographed image photographed by a camera. have. Therefore, the source domain is utilized to adapt the target domain based on the due diligence.

도 1에서 볼 때, 통신 인터페이스부(100)는 예를 들어, 비지도 도메인 적응 장치(90)가 컴퓨터나 서버 등에 구성될 때, 주변 장치와 통신하기 위한 동작을 수행한다. 예를 들어, 비지도 도메인 적응 장치(90)가 자율주행차에 구성되는 경우, 비지도 도메인 적응 장치(90)는 카메라에서 촬영되는 촬영 영상을 학습을 위해 수신하여 제어부(110)에 제공할 수 있다.Referring to FIG. 1 , the communication interface unit 100 performs an operation for communicating with a peripheral device, for example, when the unsupervised domain adaptation device 90 is configured in a computer or server. For example, when the unsupervised domain adaptation device 90 is configured in an autonomous vehicle, the unsupervised domain adaptation device 90 may receive a captured image captured by the camera for learning and provide it to the controller 110 . have.

통신 인터페이스부(100)는 별도의 압축 즉 인코딩 동작없이 데이터를 수신할 수 있지만, 가령 자율주행차의 경우 카메라와 같은 촬영장치로부터 데이터를 압축하여 수신할 필요는 없으므로, 별도의 압축 동작 없이 수신할 수 있다. 반면, 서버 등에 구성되는 경우에는 통신망을 경유하여 사용자의 컴퓨터 등의 단말장치와 통신을 위하여 변/복조, 먹싱/디먹싱, 인코딩/디코딩 등의 동작을 수행할 수 있으며, 이는 당업자에게 자명하므로 더 이상의 설명은 생략한다.Although the communication interface unit 100 can receive data without a separate compression, that is, encoding operation, for example, in the case of an autonomous vehicle, it is not necessary to compress and receive data from a photographing device such as a camera, so that it can receive data without a separate compression operation. can On the other hand, when configured in a server, etc., operations such as modulation/demodulation, muxing/demuxing, encoding/decoding, etc. can be performed for communication with a terminal device such as a user's computer via a communication network, which is obvious to those skilled in the art, so The above description is omitted.

제어부(110)는 도 1의 통신 인터페이스부(100), 비지도 도메인 적응부(120) 및 저장부(130)의 전반적인 제어 동작을 수행한다. 예를 들어, 저장부(130)에 소스 도메인 데이터와 타겟 도메인 데이터가 분류되어 저장되어 있는 경우, 제어부(110)는 비지도 도메인 적응부(120)의 요청에 따라 데이터를 저장부(130)에서 불러내어 제공할 수 있다. 예를 들어, 제어부(110)는 통신 인터페이스부(100)를 통해 자율주행차 등과 같이 촬영영상의 딥러닝을 위한 비디오 프레임을 수신한 경우를 가정해 보자. 이의 경우, 제어부(110)는 레이블이 없는 타겟 도메인 데이터를 근거로 딥러닝 동작을 수행하지만, 이때 소스 도메인 데이터에 적응시켜 이를 활용함으로써 딥러닝의 정확도를 높일 수 있다. 가령, 레이블이 없는 데이터의 경우 레이블이 있는 데이터로 학습하는 경우에 비해 정확도는 떨어진다. 따라서, 본 발명의 실시예에서는 비지도 학습에 따른 동작을 수행하지만, 해당 타겟 데이터를 이용하여 지도 학습에 준하는 동작을 수행한다고 볼 수 있다. The controller 110 performs overall control operations of the communication interface unit 100 , the unsupervised domain adaptor 120 , and the storage unit 130 of FIG. 1 . For example, when the source domain data and the target domain data are classified and stored in the storage unit 130 , the control unit 110 stores the data in the storage unit 130 according to the request of the unsupervised domain adaptation unit 120 . It can be called and provided. For example, it is assumed that the control unit 110 receives a video frame for deep learning of a captured image, such as an autonomous vehicle, through the communication interface unit 100 . In this case, the control unit 110 performs a deep learning operation based on unlabeled target domain data, but in this case, it is possible to increase the accuracy of deep learning by adapting and utilizing the source domain data. For example, in the case of unlabeled data, the accuracy is lower than in the case of learning with labeled data. Therefore, in the embodiment of the present invention, although an operation according to unsupervised learning is performed, it can be considered that an operation corresponding to supervised learning is performed using the corresponding target data.

도 2에서는 소스 도메인 데이터를 생성하기 위해 사용되는 소스 이미지와, 타겟 이미지를 예시하고 있으며, 또한 GT 이미지를 각각 보여주고 있다. 소스 이미지는 레이블 정보를 갖지만, 타겟 이미지는 레이블 정보를 갖지 않는다는 것을 보여준다. UDA 설정은 레이블이 있는 소스 도메인 데이터를 활용하여 레이블이 없는 도메인 데이터에 대하여 학습을 잘 시키고자 하는 문제이다. 따라서 본 발명의 실시예는 시멘틱 세그멘테이션 작업에 대한 UDA에 관련된다. 시멘틱 세그멘테이션 작업은 입력으로 들어오는 이미지를 의미별로 구분하는 작업이다. Source Domain Dataset = {x_s, y_s}로 표현할 수 있으며, 이는 입력 이미지(input image) x와 GT(ground-truth) 라벨 y로 구성된다. 또한, Target Domain Dataset = {x_t}로 표현될 수 있으며, 입력 이미지 x로만 구성된다. 도 2는 이를 보여주고 있다. 소스 도메인 데이터를 이용하여 타겟 도메인 데이터에 대한 세그멘테이션 네트워크의 성능을 높이도록 하는 것이며, 타겟 도메인에 레이블이 없어 지도(supervision)를 줄 수 없기 때문에 비지도(unsupervised)라는 표현을 사용한다. GT(Ground Truth)는 '지상 실측 정보' 정도로 이해해도 좋다.2 exemplifies a source image and a target image used to generate source domain data, and also shows a GT image, respectively. Shows that the source image has label information, but the target image does not. UDA setting is a problem to learn well on unlabeled domain data by using labeled source domain data. Accordingly, embodiments of the present invention relate to UDAs for semantic segmentation operations. The semantic segmentation operation is an operation of classifying an input image by meaning. It can be expressed as Source Domain Dataset = {x_s, y_s}, which consists of an input image x and a GT (ground-truth) label y. In addition, it can be expressed as Target Domain Dataset = {x_t}, and consists only of the input image x. Figure 2 shows this. This is to improve the performance of the segmentation network on the target domain data by using the source domain data, and the expression unsupervised is used because there is no label in the target domain and thus a supervision cannot be provided. GT (Ground Truth) can be understood as 'ground truth information'.

비지도 도메인 적응부(120)는 제어부(110)에 의해 제어되며, 내부에 본 발명의 실시예에 따른 픽셀간 상관 관계를 활용하는 시멘틱 세그멘테이션에서의 비지도 도메인 적응 동작을 위한 프로그램 또는 해당 프로그램이 담긴 모듈을 탑재할 수 있다. 프로그램이나 모듈은 소프트웨어나 하드웨어, 또는 그 조합에 의해 구성될 수 있으므로 본 발명의 실시예에서는 소프트웨어에 특별히 한정하지는 않을 것이다. 비지도 도메인 적응부(120)는 시멘틱 세그멘테이션으로 비지도 도메인 적응을 위한 동작을 수행한다. 이의 과정에서 비지도 도메인 적응부(120)는 레이블이 없는 타겟 이미지 또는 타겟 이미지의 도메인에 대하여 미리 나이브하게 학습된 가령 예측에 의해 학습된 프로그램 모델을 사용하여 의사 레이블을 만든다. 여기서 모델은 프로그램 뭉치이며, 특정 동작을 수행하도록 제작된 프로그램 모형에 해당한다.The unsupervised domain adaptation unit 120 is controlled by the control unit 110, and therein a program or a corresponding program for an unsupervised domain adaptation operation in semantic segmentation utilizing the inter-pixel correlation according to the embodiment of the present invention is provided. Modules can be loaded. Since a program or module may be configured by software, hardware, or a combination thereof, the embodiment of the present invention is not particularly limited to software. The unsupervised domain adaptation unit 120 performs an operation for unsupervised domain adaptation through semantic segmentation. In this process, the unsupervised domain adaptation unit 120 creates a pseudo-label by using a target image without a label or a program model naïvely learned in advance for a domain of the target image, for example, learned by prediction. Here, the model is a set of programs and corresponds to a program model designed to perform a specific operation.

본 발명의 실시예에 따른 비지도 도메인 적응부(120)는 기생성된 의사 레이블을 타겟 도메인에 대해 지도하는 과정에서 도 3에서와 같이 부분부분 구멍 난 부분 즉 검정색 부분이 많은 문제가 발생하고, 또 프로그램 모델이 예측한 값이기 때문에 틀린 부분도 많게 되므로 이를 개선해 준다. 즉 100% 정확한 값으로 지도를 줄 수 없게 되는 문제를 해결한다. 이를 위하여 본 발명의 실시예에 따른 비지도 도메인 적응부(120)는 UDA에서 의사 레이블을 통한 자기 훈련 방식을 사용할 수 있으며, 의사 레이블은 불완전하다는 단점이 있기 때문에 픽셀들간에 상관관계를 학습하는 자기 집중 모듈(SAM)을 구성하고, 해당 자기 집중 모듈을 정확한 지도를 줄 수 있는 소스 도메인에 대해서만 학습시킨다. 그리고, 세그멘테이션 네트워크의 출력이 학습된 SAM의 출력을 따라가도록 학습한다. 이는 다시 말해서, 타겟 이미지를 세그멘테이션하는 세그멘테이션부의 출력과 SAM의 출력을 서로 동기화하는 과정이라고 볼 수도 있다. 이러한 방식을 통해 세그멘테이션 네트워크가 SAM이 학습한 픽셀간 상관관계를 학습함으로 인해 성능 개선이 이루어진다. 시멘틱 세그멘테이션은 하이 레벨 구조(high-level structure)를 예측하는 작업(task)으로 클래스들(class) 즉 (객체의) 부류와의 관계, 픽셀들간의 관계가 중요하다. SAM은 이러한 픽셀들간의 상관관계를 소스 도메인에 대해서 학습한다.The unsupervised domain adaptation unit 120 according to an embodiment of the present invention has a problem in that, as in FIG. 3, in the process of guiding the pre-generated pseudo label to the target domain, there are many partially holed parts, i.e., black parts, Also, because the value predicted by the program model, there are many errors, so it is improved. In other words, it solves the problem of not being able to give a map with 100% accurate values. To this end, the unsupervised domain adaptor 120 according to an embodiment of the present invention can use a self-training method through a pseudo label in UDA, and since the pseudo label is incomplete, the self-training method for learning the correlation between pixels A concentration module (SAM) is constructed, and the self-focusing module is trained only in the source domain that can give an accurate map. Then, the output of the segmentation network is learned to follow the output of the learned SAM. In other words, this may be regarded as a process of synchronizing the output of the segmentation unit that segments the target image and the output of the SAM. In this way, performance is improved because the segmentation network learns the inter-pixel correlations learned by the SAM. Semantic segmentation is a task of predicting a high-level structure, and the relationship between classes and (object) classes and the relationship between pixels are important. SAM learns the correlation between these pixels in the source domain.

다시 말해, UDA에서는 의사 레이블을 통한 자기 훈련 방식이 널리 사용되고 있다. 그런데, 의사 레이블은 불완전하다는 단점이 있기 때문에 본 발명의 실시예에서는 이를 보완한다. 시멘틱 세그멘테이션은 하이 레벨 구조를 예측하는 작업으로 각 클래스 즉 부류(예: 사람, 동물 등등)들과의 관계, 이때 픽셀들간의 관계가 중요하다. 따라서 본 발명의 실시예에서는 픽셀들간의 상관관계를 학습하는 자기 집중 모듈(SAM)을 구성하고 그 SAM을 정확한 지도를 줄 수 있는 소스 도메인에 대해서만 학습시킨다. 세그멘테이션 네트워크의 출력이 (기)학습된 SAM의 출력을 따라가도록 학습한다. 이러한 방식을 통해 세그멘테이션 네트워크가 SAM이 학습한 픽셀간 상관관계를 학습함으로 인해 성능 개선이 이루어지게 된다.In other words, the self-training method through the pseudo-label is widely used in UDA. However, since the pseudo label is incomplete, the embodiment of the present invention compensates for this. Semantic segmentation is the task of predicting a high-level structure, and the relationship with each class or class (eg, people, animals, etc.) and the relationship between pixels are important. Therefore, in the embodiment of the present invention, a self-focusing module (SAM) for learning the correlation between pixels is configured, and the SAM is learned only for the source domain that can provide an accurate map. The output of the segmentation network is trained to follow the output of the (pre)learned SAM. In this way, performance is improved because the segmentation network learns the inter-pixel correlation learned by the SAM.

본 발명의 실시예에서는 자기 집중 모듈을 통해 소스 도메인에서 타겟 도메인으로 픽셀 간 상관관계를 전달하는 방법을 제시한다. 이러한 방법은 비지도 도메인 적응부(120) 또는 제어부(110)에서 이루어질 수 있다. 해당 방법을 실행하는 모듈은 세그멘테이션 네트워크의 예측을 입력으로 사용하고 유사한 픽셀을 상관시키는(혹은 연결짓는) 자가 예측을 생성한다. 모듈은 도메인 불변 픽셀 간 상관 관계를 학습하기 위해 소스 도메인에서만 훈련된 다음 나중에 타겟 도메인에서 세그멘테이션 네트워크를 훈련하는 데 사용된다. 네트워크는 의사 레이블뿐만 아니라 픽셀간 상관 관계에 대한 추가 정보를 제공하는 자기 집중 모듈의 출력을 따라 학습한다. 광범위한 실험을 통해서도 확인한 바와 같이 본 발명의 실시예에서 제안하는 방법은 두 가지 표준인 UDA 벤치마크(benchmarks)의 성능을 크게 향상시켰으며 또한 더 나은 성능을 달성하기 위해 최근의 최첨단 방법과 결합할 수 있음을 보여준다.An embodiment of the present invention provides a method of transferring a correlation between pixels from a source domain to a target domain through a self-focusing module. This method may be performed by the unsupervised domain adaptation unit 120 or the control unit 110 . The module implementing the method takes as input the predictions from the segmentation network and generates self-predictions that correlate (or link) similar pixels. The module is trained only in the source domain to learn the domain-invariant inter-pixel correlation, and then used later to train the segmentation network in the target domain. The network learns by following the output of the self-focusing module, which provides additional information about pseudo-labels as well as inter-pixel correlations. As confirmed through extensive experiments, the method proposed in the embodiment of the present invention greatly improved the performance of UDA benchmarks, which are two standards, and can be combined with the latest state-of-the-art method to achieve better performance. show that there is

저장부(130)는 제어부(110)의 제어하에 처리되는 데이터나 정보를 임시 저장할 수 있으며, 비지도 도메인 적응부(120)의 분석 결과 즉 분석 데이터를 저장할 수 있다. 여기서, 정보는 제어 명령 등을 지칭하는 것이지만, 실무상 두 용어는 혼용되어 사용되므로 그 용어의 개념에 특별히 한정하지는 않을 것이다. 예를 들어, 저장부(130)는 본 발명의 실시예에 따라 소스 데이터와 타겟 데이터의 영역을 분류하여 저장할 수 있으며, 비지도 도메인 적응부(120)의 요청에 따라 관련 데이터를 출력할 수 있다. 예를 들어, 저장부(130)에 저장되는 소스 데이터는 일종의 사전 역할을 수행할 수 있으며, 학습을 위해 기제공된 이미지에 대하여 사람, 동물, 차 등 클래스 즉 부류를 의미에 따라 세그멘테이션하여 기저장할 수 있다. 물론 여기서의 사전은 비지도 도메인 적응부(120)의 내부에 소프트웨어적으로 생성될 수도 있으므로 위의 내용에 특별히 한정하지는 않을 것이다.The storage unit 130 may temporarily store data or information processed under the control of the control unit 110 , and may store the analysis result of the unsupervised domain adaptation unit 120 , that is, the analysis data. Here, information refers to control commands and the like, but in practice, the two terms are used interchangeably, so the concept of the terms will not be particularly limited. For example, the storage unit 130 may classify and store regions of source data and target data according to an embodiment of the present invention, and may output related data according to the request of the unsupervised domain adaptation unit 120 . . For example, the source data stored in the storage unit 130 may serve as a kind of dictionary, and the previously provided image for learning may be segmented into classes such as people, animals, and cars according to the meaning and stored in advance. have. Of course, since the dictionary may be generated by software inside the unsupervised domain adaptation unit 120, the above content will not be particularly limited.

본 발명의 실시예에 따른 비지도 도메인 적응 장치(90)는 소스 이미지와 타겟 이미지를 동시에 학습하는 구조이다.The unsupervised domain adaptation apparatus 90 according to an embodiment of the present invention has a structure for simultaneously learning a source image and a target image.

상기한 내용 이외에도 도 1의 통신 인터페이스부(100), 제어부(110), 비지도 도메인 적응부(120) 및 저장부(130)는 다양한 동작을 수행할 수 있으며, 기타 자세한 내용은 이후에 좀더 설명되므로 그 내용들로 대신하고자 한다.In addition to the above, the communication interface unit 100, the control unit 110, the unsupervised domain adaptation unit 120, and the storage unit 130 of FIG. 1 may perform various operations, and other details will be described later. Therefore, we would like to replace them with the contents.

본 발명의 실시예에 따른 도 1의 통신 인터페이스부(100), 제어부(110), 비지도 도메인 적응부(120) 및 저장부(130)는 서로 물리적으로 분리된 하드웨어 모듈로 구성되지만, 각 모듈은 내부에 상기의 동작을 수행하기 위한 소프트웨어를 저장하고 이를 실행할 수 있을 것이다. 다만, 해당 소프트웨어는 소프트웨어 모듈의 집합이고, 각 모듈은 하드웨어로 형성되는 것이 얼마든지 가능하므로 소프트웨어니 하드웨어니 하는 구성에 특별히 한정하지 않을 것이다. 예를 들어 저장부(130)는 하드웨어인 스토리지(storage) 또는 메모리(memory)일 수 있다. 하지만, 소프트웨어적으로 정보를 저장(repository)하는 것도 얼마든지 가능하므로 위의 내용에 특별히 한정하지는 않을 것이다.The communication interface unit 100, the control unit 110, the unsupervised domain adaptation unit 120, and the storage unit 130 of FIG. 1 according to an embodiment of the present invention are composed of hardware modules physically separated from each other, but each module may store software for performing the above operation therein and execute it. However, since the software is a set of software modules, and each module can be formed of hardware, it will not be particularly limited to the configuration of software or hardware. For example, the storage unit 130 may be a hardware storage (storage) or a memory (memory). However, since it is possible to store information in software (repository), it will not be particularly limited to the above.

한편, 본 발명의 다른 실시예로서 제어부(110)는 CPU 및 메모리를 포함할 수 있으며, 원칩화하여 형성될 수 있다. CPU는 제어회로, 연산부(ALU), 명령어해석부 및 레지스트리 등을 포함하며, 메모리는 램을 포함할 수 있다. 제어회로는 제어동작을, 그리고 연산부는 2진비트 정보의 연산동작을, 그리고 명령어해석부는 인터프리터나 컴파일러 등을 포함하여 고급언어를 기계어로, 또 기계어를 고급언어로 변환하는 동작을 수행할 수 있으며, 레지스트리는 소프트웨어적인 데이터 저장에 관여할 수 있다. 상기의 구성에 따라, 가령 비지도 도메인 적응 장치(90)의 동작 초기에 비지도 도메인 적응부(120)에 저장되어 있는 프로그램을 복사하여 메모리 즉 램(RAM)에 로딩한 후 이를 실행시킴으로써 데이터 연산 처리 속도를 빠르게 증가시킬 수 있다. 딥러닝 모델 같은 경우 램(RAM)이 아닌 GPU 메모리에 올라가 GPU를 이용하여 수행 속도를 가속화하여 실행될 수도 있다.Meanwhile, as another embodiment of the present invention, the control unit 110 may include a CPU and a memory, and may be formed as a single chip. The CPU includes a control circuit, an arithmetic unit (ALU), an instruction interpreter and a registry, and the memory may include a RAM. The control circuit performs a control operation, the operation unit performs an operation operation of binary bit information, and the instruction interpreter performs an operation of converting a high-level language into a machine language and a machine language into a high-level language, including an interpreter or a compiler. , the registry may be involved in software data storage. According to the above configuration, for example, at the beginning of the operation of the unsupervised domain adaptation device 90 , the program stored in the unsupervised domain adaptation unit 120 is copied, loaded into a memory, that is, a RAM, and then executed to perform data operation. The processing speed can be increased quickly. In the case of a deep learning model, it can be executed by accelerating the execution speed by using the GPU memory rather than RAM (RAM).

도 4는 본 발명의 실시예에 따른 자기 집중 모듈의 프레임워크를 보여주는 도면, 도 5는 자기 집중 모듈과 세그멘테이션 네트워크의 훈련을 설명하기 위한 도면, 그리고 도 6은 집중과 예측 시각화를 비교 설명하기 위한 도면이다.4 is a diagram showing a framework of a self-focusing module according to an embodiment of the present invention, FIG. 5 is a diagram for explaining the training of a self-focusing module and a segmentation network, and FIG. 6 is a diagram for explaining a comparison between concentration and predictive visualization It is a drawing.

도 4에 도시된 바와 같이, 본 발명의 실시예에 따른 도 1의 비지도 도메인적응부(130), 또는 그에 포함되는 자기 집중 모듈(130')은 자기 집중부라 명명될 수 있으며, 세그멘테이션 네트워크(부)(400) 및 자기 집중부(410)로 구성될 수 있다. 도 4에서 볼 때, 중앙의 회색 음영 영역은 자기 집중(혹은 주의) 모듈이고

은 행렬 곱(matrix product)을 나타낸다. 모듈은 세그멘테이션 네트워크의 출력을 가져오고 z에 자기 집중을 적용하기 위해 집중 맵을 만든다. z'는 픽셀 간 상관 관계에 대한 정보를 가지고 있는 모듈의 새로운 자기 집중(new self-attended) 출력이 된다. 여기서 "자기 집중"은 가령 이미지에서 모든 위치들과 상호 작용하고, 더 두드러진 위치들에 더 집중하거나 주의를 기울임으로써 순서대로 어느 위치에서의 응답을 계산하기 위한 방법을 의미할 수 있다.As shown in FIG. 4, the unsupervised domain adaptation unit 130 of FIG. 1 or the self-concentration module 130' included therein according to an embodiment of the present invention may be called a self-concentration unit, and the segmentation network ( part) 400 and a magnetic concentrator 410 . 4 , the gray shaded area in the center is the self-focusing (or attention) module.

denotes a matrix product. The module takes the output of the segmentation network and creates a concentration map to apply magnetic concentration to z. z' becomes the new self-attended output of the module, which holds information about inter-pixel correlation. "Self-focus" here may refer to a method for calculating a response at a location in order, for example by interacting with all locations in an image and paying more attention or focus to more prominent locations.

UDA 설정에서 원본 도메인의 데이터 세트 {Xs, Ys}에는 레이블이 있고 타겟 도메인의 데이터 세트 {Xt}에는 레이블이 없는 것으로 가정한다. 의미론적 분할을 위한 UDA의 주요 목적은 소스 도메인에서 배운 지식을 전달하여 타겟 도메인에서 잘 수행되도록 분할 즉 세그멘테이션 네트워크(g)를 훈련시키는 것이다. 두 도메인은 동일한 C 클래스를 공유하는 것으로 가정하므로 두 도메인 간에 적응할 수 있다. 일반적으로 소스 도메인의 경우 범주형 교차 엔트로피 손실(categorical cross-entropyloss)이 주어진 GT 레이블(ground truth labels)을 사용하여 네트워크를 훈련하는 데 사용된다. <수학식 1>과 같이 표현될 수 있다.In the UDA setup, it is assumed that the data set {Xs, Ys} in the source domain has a label and the data set {Xt} in the target domain does not have a label. The main purpose of UDA for semantic segmentation is to pass the knowledge learned in the source domain to train the segmentation or segmentation network (g) to perform well in the target domain. The two domains are assumed to share the same C class, so they can adapt between the two domains. In general, for the source domain, a categorical cross-entropyloss is used to train the network using the given GT labels (ground truth labels). It can be expressed as <Equation 1>.

여기서

이고,

는 클래스 c마다 속하는 픽셀 i의 소프트맥스(softmax) 확률

를 나타낸다. 레이블

는 원-핫(one-hot) 벡터이고,

는 픽셀 i에서 GT 클래스가 c이면 1이고, 그렇지 않으면 0이다. 그렇지만 타겟 도메인의 경우 레이블은 존재하지 않으며, 의사 레이블들

이 자가 훈련을 수행하기 위해 생성된다. 의사 레이블은 훈련된 네트워크에 의해 예측된 신뢰성 있는 픽셀들(confident pixels)에 근거한다. 의사 레이블들은 주어진 임계치보다 더 높은 신뢰성을 가지는 픽셀들에게만 할당된다. 이는 <수학식 2> 및 <수학식 3>과 같이 표현될 수 있다.here

ego,

is the softmax probability of a pixel i belonging to each class c

indicates label

is a one-hot vector,

is 1 if the GT class is c in pixel i, and 0 otherwise. However, in the case of the target domain, labels do not exist, and pseudo-labels are

This self is created to perform training. The pseudo-label is based on the reliable pixels predicted by the trained network. Pseudo-labels are only assigned to pixels with a higher reliability than a given threshold. This can be expressed as <Equation 2> and <Equation 3>.

임계치 T^c는 각 클래스 c와 다르게 설정되고, 결정 방법과 관련해서는 해당 기술분야의 당업자에게 이미 알려진 바 있으므로, 더 이상의 설명은 생략한다. 생성된 의사 레이블을 사용하여 타겟 도메인도 <수학식 3>을 사용하여 지도된 방식으로 훈련한다. 전체 분할 즉 세그멘테이션 손실은 <수학식 1>과 <수학식 3>의 합이다. 이러한 방식으로 타겟 도메인의 네트워크 성능이 크게 향상될 수 있지만 본 발명의 실시예에서는 도 4에서와 같이 새로운 자기 집중 모듈(SAM)을 적용하여 이를 더욱 향상시킨다.The threshold T ^c is set differently from each class c, and since it is known to those skilled in the art with respect to the determination method, further description is omitted. Using the generated pseudo-labels, the target domain is also trained in a supervised manner using Equation (3). The total division, that is, the segmentation loss, is the sum of <Equation 1> and <Equation 3>. In this way, the network performance of the target domain can be greatly improved, but in the embodiment of the present invention, a new self-focusing module (SAM) is applied as shown in FIG. 4 to further improve it.

도 4에 도시된 바와 같이, 본 발명의 실시예는 SAM을 통한 픽셀들간의 상관관계를 학습하여 세그멘테이션 네트워크(400)의 예측을 자기 집중시켜 강화시키고, 그 강화된 값을 따라가도록 하는 손실(Loss)을 설계하여 픽셀들간의 관계를 학습시키도록 한다. 이때, SAM은 확실한 GT 레이블이 있는 소스 도메인에 대해서만 학습하여 픽셀간 상관관계들을 잘 학습하도록 한다. 도 4에서 z는 세그멘테이션 네트워크(400)의 최종 출력이고, 회색의 둥근 모서리 사각형이 SAM에 해당한다. SAM은 자기집중부(410)라 명명될 수 있다. SAM은 z를 입력으로 받아서 1×1 Conv 레이어(layer)에 태워 다른 특징 공간(feature space)으로 전환(transform)시킨다. 그리고 그 전환된 값, 가령 도 4에서 Conv(z) 및 <수학식 4>에서 A에 해당하는 부분으로 <수학식 4>를 사용하여 집중 맵을 생성한다.As shown in Fig. 4, the embodiment of the present invention learns the correlation between pixels through SAM to self-focus and strengthen the prediction of the segmentation network 400, and to follow the strengthened value. ) to learn the relationship between pixels. At this time, the SAM learns only the source domain with a definite GT label to learn the inter-pixel correlations well. In FIG. 4 , z denotes the final output of the segmentation network 400 , and a gray rounded corner rectangle corresponds to the SAM. The SAM may be called a self-focusing unit 410 . SAM receives z as an input and converts it to another feature space by burning it in a 1×1 Conv layer. Then, a concentration map is generated by using the converted value, for example, Conv(z) in FIG. 4 and <Equation 4> as a part corresponding to A in <Equation 4>.

여기서, ㆍ는 행렬 곱이고,

는 채널 차원과 함께 A의 행 단위(row-wise) L2-기준을 나타내고, 나눗셈(division)이 요소 단위로 이루어진다. M의 각 행은 각 픽셀이 코사인 유사도(cosine similarity)에서 다른 픽셀과 어떻게 유사한지를 보여주므로 해당 요소는 -1과 1 사이에서 정규화된다. 본 발명의 실시예에서는 음의(negative) 유사성을 제거하기 위해 M에서의 ReLU 활성화를 적용한 다음 L1은 동일 행에서의 모든 요소들의 합을 1로 동등하게 만들기 위해서 각 행을 정규화한다. 여기서, ReLU 함수는 정류 선형 유닛에 대한 함수이다. 또한, 본 발명의 실시예에서는 ReLU 활성화 및 정규화 이후에 M'를 집중 맵으로 표시(denote)한다.

는 M'의 i번째 행과 j번째 열(column)의 요소를 나타내며, <수학식 5>에서와 같이 계산될 수 있다.where, ㆍ is the matrix product,

denotes the row-wise L2-reference of A along with the channel dimension, and division is done element-wise. Each row of M shows how each pixel is similar to another pixel in cosine similarity, so its elements are normalized between -1 and 1. In the embodiment of the present invention, after ReLU activation in M is applied to remove negative similarity, L1 normalizes each row to make the sum of all elements in the same row equal to 1. Here, the ReLU function is a function of the rectified linear unit. In addition, in the embodiment of the present invention, M' is denoted as a concentration map after ReLU activation and normalization.

represents the elements of the i-th row and j-th column of M', and can be calculated as in Equation 5.

집중 맵(M')은 예측된 픽셀들간의 상관 관계를 나타낸다. 즉 집중 맵의 각 행은 각 픽셀이 자기 자신을 포함해서 다른 픽셀들과 얼마나 유사한지 코사인 유사도를 나타낸다. 여기서, 코사인 유사도는 두 벡터간의 코사인 각도를 이용하여 구할 수 있는 두 벡터의 유사도를 의미한다. 두 벡터의 방향이 완전히 동일한 경우에는 1의 값을 가지며, 90°의 각을 이루면 0, 180°로 반대의 방향을 가지면 -1의 값을 갖는다. 즉 결국 코사인 유사도는 -1 이상 1 이하의 값을 가지며 값이 1에 가까울수록 유사도가 높다고 판단한다. 이를 직관적으로 이해하면 두 벡터가 가리키는 방향이 얼마나 유사한가를 의미한다. The concentration map M' represents the correlation between predicted pixels. That is, each row of the concentration map represents the cosine similarity of how similar each pixel is to other pixels, including itself. Here, the cosine similarity refers to the similarity between two vectors obtained by using the cosine angle between the two vectors. If the directions of the two vectors are exactly the same, they have a value of 1, if they form an angle of 90°, they are 0, 180°, and if they have opposite directions, it has a value of -1. That is, in the end, the cosine similarity has a value greater than or equal to -1 and less than or equal to 1, and it is determined that the closer the value is to 1, the higher the similarity is. To understand this intuitively, it means how similar the directions two vectors point to.

본 발명의 실시예에서는 ReLU 활성화를 씌워서 음의 유사도는 다 없애버리고 행 단위로 L1 정규화를 취해준다. 그렇게 만들어진 집중 맵을 입력 z와 행렬 곱셈을 하여 z'를 최종적으로 만들어 낸다. <수학식 5>는 집중 맵에 ReLU를 씌우고 L1 정규화하는 식을 나타내며, <수학식 6>은 입력 z와 M' 즉 집중 맵을 행렬 곱셈하여 자기 집중된 z'를 만드는 수식을 나타낸다.In the embodiment of the present invention, by applying ReLU activation, all negative similarities are removed and L1 normalization is performed on a row-by-row basis. The resulting concentration map is subjected to matrix multiplication with the input z to finally produce z'. <Equation 5> represents an expression of covering the concentration map with ReLU and L1 normalization, and <Equation 6> represents a formula for generating a self-focused z' by matrix multiplication of the input z and M', that is, the concentration map.

본 발명의 실시예에 따른 SAM은 소스 도메인에 대해서만 학습시킨다.The SAM according to an embodiment of the present invention learns only for the source domain.

그리고 SAM 학습시 세그멘테이션 네트워크(400)와 SAM을 결합(joint)하여 학습해 주고 스킵 커넥션(skip-connection)(z + z' = z")을 만들어 세그멘테이션 네트워크(400)가 좋은 표현(representation)을 만들도록 학습하고 잘 학습된 표현을 SAM이 학습하도록 한다. 여기서, 스킵 커넥션은 하나의 컨볼루션(convolution) 레이어를 기준으로 봤을 때, 레이어의 입력값을 레이어 통과 후의 출력값과 합쳐서 다음 레이어로 넘겨주는 것을 의미한다. 이때 레이어의 입력값이 해당 레이어를 통과하지 않고 다음 레이어로 넘어가기 때문에 스킵 커넥션이라고 한다.And, when learning SAM, the segmentation network 400 and the SAM are jointed to learn, and a skip-connection (z + z' = z") is made so that the segmentation network 400 provides a good representation. Learn to make and let SAM learn a well-learned expression, where skip connection combines the input value of a layer with the output value after passing through the layer and passes it to the next layer, based on one convolution layer. In this case, it is called a skip connection because the input value of a layer does not pass through the layer but goes to the next layer.

학습이 끝나면 SAM만 남겨서 실제 도메인 적응을 위한 훈련에 사용되고 세그멘테이션 네트워크(400)는 버려진다. 이는 도 5의 (a)에 해당하는 부분이다. (a)는 모듈이 소스 도메인의 GT 레이블을 통해 정확한 픽셀 관계를 학습하기 위해 소스 도메인에서만 훈련되는 것을 보여준다. 도 5의 (b)는 소스 훈련된 자기 집중 모듈이 세그멘테이션 네트워크의 도메인 적응 훈련에 사용되는 것을 보여준다. 세그멘테이션 네트워크(400)는 자기 훈련 기법(scheme)과 함께 모듈의 출력 z' 또는 z"를 z가 뒤따르도록 해서 훈련된다. Latt는 두 도메인들을 위해 계산될 수 있다.When the learning is finished, only the SAM is left and used for training for real domain adaptation, and the segmentation network 400 is discarded. This is a part corresponding to (a) of FIG. 5 . (a) shows that the module is trained only in the source domain to learn exact pixel relationships through GT labels in the source domain. 5B shows that the source-trained self-focusing module is used for domain adaptive training of a segmentation network. The segmentation network 400 is trained with a self-training scheme such that the output z' or z" of the module is followed by z. Latt can be computed for both domains.

좀더 살펴보면, 도 5의 (b) 과정에서 실제 세그멘테이션 네트워크(400)의 타겟 도메인에 대한 적응 학습이 진행된다. 세그멘테이션 네트워크(400)가 학습되는 부분이고 실제 나중에 성능 측정에 사용되는 부분이다. (a) 과정을 통해 학습된 SAM이 (b)에서 실제 도메인 적응을 위해 활용된다. 이때 학습된 SAM은 동결(freeze)이 된 상태이다. z가 SAM을 거쳐서 z'가 만들어지고, z가 z' 또는 z"값을 따라가도록 하는 손실을 계산한다. 이 손실은 본 발명의 실시예에서 자기 집중 손실(self-attention loss)이라 명명될 수 있다. z' 또는 z"를 따라 갈지는 소스 도메인에 따라서 다르게 설정된다. <수학식 7> 및 <수학식 8>은 자기 집중 손실에 해당하는 수식이다.In more detail, adaptive learning for the target domain of the actual segmentation network 400 is performed in the process (b) of FIG. 5 . The segmentation network 400 is a learned part and is actually used for later performance measurement. The SAM learned through (a) is utilized for real domain adaptation in (b). At this time, the learned SAM is in a frozen state. z' is made through the SAM, and the loss is calculated so that z follows the value of z' or z". This loss can be called self-attention loss in the embodiment of the present invention. Whether to follow z' or z" is set differently depending on the source domain. <Equation 7> and <Equation 8> are equations corresponding to the loss of self-focus.

<수학식 7>은 z", <수학식 8>은 z'와 최소화하는 손실이다.<Equation 7> is z", <Equation 8> is z' and a loss to be minimized.

최종 손실 조건(term)은 <수학식 9> 및 <수학식 10>과 같이 표현될 수 있다. <수학식 9>는 자기 집중 손실을 소스와 타겟 도메인 모두에 대해서는 주는 경우이고, <수학식 10>은 타겟 도메인에 대해서만 주는 것을 나타낸다.The final loss condition may be expressed as <Equation 9> and <Equation 10>. <Equation 9> represents a case in which a loss of self-focus is given to both the source and target domains, and <Equation 10> represents that the loss of self-focus is given only to the target domain.

L^S_seg, L^T_seg 은 각각 소스 도메인 세그멘테이션 손실과 의사 레이블을 통한 타겟 도메인 세그멘테이션 손실을 의미한다.L^S_seg and L^T_seg mean a source domain segmentation loss and a target domain segmentation loss through a pseudo label, respectively.

상기한 바와 같이, 본 발명의 실시예에서 제안하는 자기 집중 손실은 픽셀들간의 상관 관계를 학습하는 방법으로서 의사 레이블을 통한 자기 훈련을 보조하면서 모델의 성능을 더 끌어내는 역할을 한다고 볼 수 있다.As described above, the loss of self-focus proposed in the embodiment of the present invention is a method of learning the correlation between pixels, and it can be seen that it serves to further enhance the performance of the model while assisting self-training through the pseudo-label.

계속해서 본 발명의 실시예와 관련한 실험에 대하여 좀더 살펴본다.Continuing to look at the experiments related to the embodiment of the present invention in more detail.

본 실험을 위하여 가상 게임 장면에서 수집한 사실적인 합성 데이터 세트인 GTA5 및 SYNTHIA의 두 가지 소스 도메인 데이터 집합을 사용하여 UDA 실험을 수행하였다. 타겟 도메인은 운전 시나리오에서 실제 장면으로 구성된 도시경관(Cityscapes)이다. GTA5는 24,966개의 이미지를 포함하고 Cityscapes와 19개의 클래스를 공유하는 반면 SYNTHIA 데이터 세트는 9,400개의 이미지를 포함하고 Cityscapes와 공통적으로 16개의 클래스를 공유한다. 본 실험을 통해 SYNTHIA에 대한 13개의 공통 클래스의 결과를 추가로 확인할 수 있었다. Cityscapes 데이터집합에는 2,975개의 훈련 이미지와 500개의 검증 이미지가 포함되어 있다. 본 실험에서 이전 작업의 표준 프로토콜에 따라 검증 셋(validation set)에 대한 실험을 수행하였다. 검증 셋은 훈련 셋에서 뽑은 것이 아니라 독립적으로 구성된 셋을 의미하며 훈련셋과 클래스를 공유하지만 담고 있는 이미지들이 다르다. 훈련셋에서 학습한 모델이 학습 과정시 보지 못한 데이터에 대해서 얼마나 잘 동작하는지 테스트해보기 위함이다. For this experiment, a UDA experiment was performed using two source domain data sets, GTA5 and SYNTHIA, which are realistic synthetic data sets collected from virtual game scenes. The target domain is Cityscapes composed of real scenes in a driving scenario. GTA5 contains 24,966 images and shares 19 classes with Cityscapes, while the SYNTHIA dataset contains 9,400 images and shares 16 classes in common with Cityscapes. Through this experiment, it was possible to additionally confirm the results of 13 common classes for SYNTHIA. The Cityscapes dataset contains 2,975 training images and 500 validation images. In this experiment, an experiment on the validation set was performed according to the standard protocol of the previous work. The validation set is not selected from the training set, but an independently constructed set. It shares a class with the training set, but the images it contains are different. This is to test how well the model trained on the training set works on data not seen during the training process.

실험은 주로 ResNet101 백본이 있는 DeepLabV2를 사용하여 수행되었다. 또한 VGG16 백본이 있는 FCN-8을 사용하여 실험을 수행하였다. ImageNet 사전 훈련된 가중치로 네트워크의 백본을 초기화하였다. 2.5 × 10^-4의 초기 학습률과 0.0005의 가중치 감소(weightdecay)를 갖는 SGD 옵티마이저를 사용하였다. 학습률은 0.9의 거듭제곱을 가진 '폴리(poly)' 학습률 정책을 사용하여 스케쥴되었다. 자기 집중 모듈 교육 단계에서 세그멘테이션 네트워크와 SAM은 둘다 동일한 옵티마이저에 의해 공동으로 훈련되었다. 자기 집중 모듈을 통해 세그멘테이션 네트워크를 도메인 적응 학습할 때 마지막 세이브 포인트에 저장된 사전 학습된 자기 집중 모듈을 사용하였다. 네트워크는 배치(batch) 크기가 1로 120,000번 반복하여 훈련되었다. 2,000번의 반복마다 테스트되고 저장되었다. 하이퍼파라미터 λ는 실험적으로(empirically) 0.1로 설정되었다. 본 실험에서는 이미지 수준에서 도메인 격차(gap)를 더욱 좁히도록 Cityscapes로 스타일이 변환된(style-transferred) 소스 이미지들이 활용되었다.Experiments were mainly performed using DeepLabV2 with ResNet101 backbone. Experiments were also performed using FCN-8 with a VGG16 backbone. The backbone of the network was initialized with ImageNet pretrained weights. An SGD optimizer with an initial learning rate of 2.5 × 10 ^-4 and a weightdecay of 0.0005 was used. The learning rate was scheduled using a 'poly' learning rate policy with powers of 0.9. In the self-focused module training phase, both the segmentation network and the SAM were jointly trained by the same optimizer. When domain adaptive learning of the segmentation network through the self-concentration module, the pre-trained self-concentration module stored in the last save point was used. The network was trained 120,000 iterations with a batch size of 1. Tested and stored every 2,000 iterations. The hyperparameter λ was set empirically to 0.1. In this experiment, style-transferred source images with Cityscapes were used to further narrow the domain gap at the image level.

<표 1>에서는 애블레이션 연구(ablaton study), 의사 레이블만 사용하여 학습했을 때보다 본 발명의 실시예에 따른 방식을 사용했을 때 더 좋은 성능을 보여준다. 개수(numbers)는 GTA5 → CS 및 SYNTHIA → CS의 경우 19개 클래스의 mIoU 및 16개 클래스의 mIoU이다. SAM에서 Conv 레이어가 없거나 스킵 커넥션이 없는 상태로 학습할 때에 비해서 더 좋은 성능을 보여주고 있는 것을 확인할 수 있다. <표 1>에서 빨간색 숫자 (8) 내지 (11)은 <수학식 7> 내지 <수학식 10>을 각각 적용한 경우를 나타낸다.<Table 1> shows better performance when using the method according to the embodiment of the present invention than when learning using only an ablaton study and a pseudo label. The numbers are 19 classes of mIoU and 16 classes of mIoU for GTA5 → CS and SYNTHIA → CS. It can be seen that SAM shows better performance than when learning without Conv layer or skip connection. In <Table 1>, red numbers (8) to (11) indicate cases in which <Equation 7> to <Equation 10> are applied, respectively.

<표 2>는 반복 훈련 실험 결과를 보여준다. 학습된 모델로 새로운 의사 레이블을 만들어서 재학습시키고, 재학습된 모델로 새로운 의사 레이블을 만드는 식으로 반복적으로 실험을 해 본 실험의 결과를 보여준다. 의사 레이블만 썼을 때는 세대(generation)가 거듭되어도 성능 향상이 없거나 느린 면에 반해 본 발명의 실시예에 따른 방식은 성능 개선 폭도 크고 더 잘 학습이 되는 것을 확인하였다. <표 2>에서 볼 때 가장 좋은 결과는 굵게 표시되어 있다. mIoU 19 및 mIoU 16은 각각 GTA5 CS 및 SYNTHIA CS에 사용되었다.<Table 2> shows the results of repeated training experiments. It shows the results of experiments that were repeatedly performed by making new pseudo-labels with the trained model and retraining them, and creating new pseudo-labels with the retrained model. When only the pseudo label was written, it was confirmed that the performance improvement was large and learning was better in the method according to the embodiment of the present invention, whereas there was no or slow performance improvement even with repeated generations. As seen in <Table 2>, the best results are shown in bold. mIoU 19 and mIoU 16 were used for GTA5 CS and SYNTHIA CS, respectively.

<표 3> 및 <표 4>는 그 이외에 다른 방식들과 비교 실험한 결과를 보여준다. 본 발명의 실시예는 ProDA를 제외한 나머지 방법들보다 좋은 성능을 보임을 확인할 수 있다. ProDA와 같은 경우 본 발명의 실시예에서 제안한 자기 집중 손실을 주어 학습시키면 더 좋은 성능이 나오는 것을 확인할 수 있었다. <표 3>은 GTA5 Cityscapes에서 다른 방법과 결과를 비교한 것이다. 굵게 표시된 숫자는 각 열에 대한 최고 점수이다. <표 4>는 SYNTHIA Cityscapes에서 다른 방법과의 비교 결과를 보여준다. 굵게 표시된 숫자는 역시 각 열에 대한 최고 점수입니다. mIoU* 및 mIoU는 각각 13개 클래스 및 16개 클래스의 mIoU를 나타낸다. <표 3> 및 <표 4>에서 연두색 숫자는 본 발명의 실시예와 관련한 논문의 종래기술을 나타내는 것이지만, 무시해도 좋다.<Table 3> and <Table 4> show the results of comparison experiments with other methods. It can be seen that the embodiment of the present invention shows better performance than the other methods except for ProDA. In the case of ProDA, it was confirmed that better performance was obtained when learning by giving the magnetic concentration loss proposed in the embodiment of the present invention. <Table 3> compares different methods and results in GTA5 Cityscapes. The number in bold is the highest score for each column. <Table 4> shows the results of comparison with other methods in SYNTHIA Cityscapes. The numbers in bold are also the highest scores for each column. mIoU* and mIoU represent 13 classes and 16 classes of mIoU, respectively. In <Table 3> and <Table 4>, yellow-green numbers indicate the prior art of the thesis related to the embodiment of the present invention, but may be ignored.

한편, 도 6에서 볼 때 각 행은 하나의 클래스에 해당되고 제일 신뢰할만한 픽셀을 기준으로 다른 픽셀들과 얼마나 유사한지를 시각화한 것이다. 의사 레이블만은 다른 클래스와도 유사도가 높지만 본 발명의 실시예에 따른 경우(Ours) 동일 클래스들과만 유사도가 높다. 붉을수록 높은 유사도이고, 파랄수록 낮은 유사도이다. 파란색 점선 위의 이미지는 집중 시각화이고 선 아래의 이미지는 예측 시각화를 나타낸다. 집중 시각화에서 빨간색 점선의 왼쪽에 있는 이미지는 GTA5 Cityscapes의 결과이다. 오른쪽은 SYNTHIA Cityscapes이다. 집중 시각화의 각 행은 위에서 아래로 각각 'pole', 'light', 'sign', 'vege.' 및 'person'을 나타낸다. 하나의 이미지는 다른 클래스들과 방식들 사이에 차이점을 분명하게 보여주기 위해 사용된다. Meanwhile, as shown in FIG. 6 , each row corresponds to one class and visualizes how similar it is to other pixels based on the most reliable pixel. The pseudo label only has a high similarity with other classes, but only with the same classes according to the embodiment of the present invention (Ours). The more red, the higher the similarity, and the more blue, the lower the similarity. The image above the blue dotted line is the focused visualization and the image below the line is the predictive visualization. The image to the left of the red dotted line in the focused visualization is the result of GTA5 Cityscapes. On the right is SYNTHIA Cityscapes. Each row in the focused visualization is 'pole', 'light', 'sign', 'vege.' from top to bottom, respectively. and 'person'. An image is used to clearly show the differences between different classes and methods.

결론적으로 본 발명의 실시예는 소스 도메인에서 사전 학습된 자기 집중 모듈을 활용하여 소스 도메인에서 타겟 도메인으로 도메인 불변 픽셀 간 상관 관계를 전달하는 방법을 설명하였다. 이러한 방법은 매우 간단하고 복잡하고 휴리스틱 알고리즘(예: 하이퍼 매개변수를 조절하는 등)이 필요하지 않지만 매우 효과적이었다. 또한 본 발명의 실시예에 따른 방법은 SAM에서 제공하는 추가 정보로 노이즈가 많은 의사 레이블의 결함을 극복하여 자체 훈련을 지원할 수 있다. 또한 본 발명의 실시예에 따른 방법은 다른 방법에 더하여 쉽게 사용할 수 있음을 보여주어 성능을 더욱 향상시킬 수 있다. 제안된 방법은 최근의 SOTA 방법과 결합될 때 두 UDA 벤치마크에서 새로운 SOTA 점수를 설정할 수 있을 것이다.In conclusion, the embodiment of the present invention has described a method of transferring the correlation between domain invariant pixels from the source domain to the target domain by utilizing a self-focusing module pre-trained in the source domain. Although this method is very simple, complex, and does not require heuristic algorithms (eg adjusting hyperparameters, etc.), it has been very effective. In addition, the method according to an embodiment of the present invention can support self-training by overcoming the defect of a noisy pseudo label with additional information provided by the SAM. In addition, the method according to the embodiment of the present invention can further improve performance by showing that it can be easily used in addition to other methods. When the proposed method is combined with the recent SOTA method, it will be able to establish a new SOTA score in both UDA benchmarks.

종합해 보면, 의미론적 분할 작업에는 밀접하게 관련된 공간 및 지역 정보를 포함하는 고도로 구조화된 픽셀 수준 출력을 예측하는 모델이 필요하다. 예를 들어 하늘에 해당하는 픽셀은 일반적으로 이미지의 위쪽 영역에 위치하고 자동차 픽셀은 도로 픽셀 위에 있으며 보도와 울타리 픽셀은 일반적으로 도로 측면에 있다. 이러한 픽셀 간 관계는 서로 다른 도메인에 대해 일관된 도메인 불변 속성이다. 따라서 본 발명의 실시예에서는 픽셀 간 상관에 대한 이러한 도메인 불변 지식을 소스 도메인에서 타겟 도메인으로 이전하는 방법을 제안하였다. 본 발명의 실시예에서는 도메인 불변 픽셀 상관 관계에 대한 추가 지식(혹은 정보)을 제공하면 희소하고 잘못된 의사 레이블의 결함을 극복하고 보완하는 데 도움이 될 수 있다.Taken together, semantic segmentation tasks require models that predict highly structured pixel-level outputs that contain closely related spatial and local information. For example, pixels corresponding to the sky are typically located in the upper region of the image, car pixels are above road pixels, and sidewalk and fence pixels are usually located on the side of the road. These inter-pixel relationships are domain invariant properties that are consistent across different domains. Therefore, in the embodiment of the present invention, a method for transferring such domain invariant knowledge about inter-pixel correlation from the source domain to the target domain is proposed. In an embodiment of the present invention, providing additional knowledge (or information) about domain invariant pixel correlation can help overcome and compensate for the deficiencies of sparse and erroneous pseudo-labels.

이러한 픽셀 간 상관 관계를 효과적으로 캡처 즉 획득하여 타겟 도메인으로 전송하기 위해 새로운 자기 집중 모듈(SAM)을 제시하였다. 제시한 자기 집중 모듈은 세그멘테이션 네트워크의 예측을 받아 새로운 자기 집중 예측을 출력한다. 모듈은 픽셀 간의 유사성을 포착하는 집중 맵을 생성하고 이 맵을 사용하여 주어진 예측의 가중치 합을 계산한다. 제안하는 알고리즘은 두 단계로 구성된다.A novel self-focusing module (SAM) is proposed to effectively capture or acquire this inter-pixel correlation and transmit it to the target domain. The proposed self-focusing module receives predictions from the segmentation network and outputs new self-focused predictions. The module creates a concentration map that captures the similarities between pixels and uses this map to compute the weighted sum of the given predictions. The proposed algorithm consists of two steps.

첫 번째 단계는 도메인 불변 픽셀 간 상관에 대해 올바르게 안내할 수 있는 GT(Ground Truth) 레이블이 있기 때문에 소스 도메인에서만 자기 집중 모듈을 훈련하는 것이다. 두 번째 단계는 학습된 지식을 타겟 도메인에 전달하여 학습된 자기 집중 모듈을 사용하여 실제 세그멘테이션 네트워크를 학습하는 것이다. 이 단계에서 집중 모듈은 고정되고 주 세그멘테이션 네트워크는 훈련되는 유일한 부분이다. 이 모듈은 세그멘테이션 네트워크의 예측을 입력으로 사용하고 새로운 자기 집중 예측을 출력한다. 그런 다음, 새로운 자기 집중 예측(self-attened predictions)을 따르도록 세그멘테이션 네트워크를 훈련한다. 본 발명의 실시예에서는 이를 '자기 집중 손실'이라고 명명하였다. 전반적으로, 세그멘테이션 네트워크는 자기 집중 손실과 자기 훈련 손실 모두로 훈련된다.The first step is to train the self-focusing module only in the source domain, as there are ground truth (GT) labels that can guide us correctly about domain-invariant inter-pixel correlations. The second step is to transfer the learned knowledge to the target domain and train the actual segmentation network using the learned self-focusing module. At this stage, the focus module is fixed and the main segmentation network is the only part that is trained. This module takes the prediction of the segmentation network as input and outputs a new self-focused prediction. Then, the segmentation network is trained to follow the new self-attened predictions. In the embodiment of the present invention, this is called 'magnetic concentration loss'. Overall, the segmentation network is trained with both a loss of self-focus and a loss of self-training.

본 발명의 실시예에서는 두 가지 표준 UDA 벤치마크인 GTA5_Cityscapes 및 SYNTHIA _Cityscapes에 대해 광범위한 실험을 수행하였다. 그 방법으로 훈련된 네트워크가 자체 훈련으로만 훈련된 네트워크보다 성능이 우수함을 보여주었다. 특히, 본 발명의 실시예에 따른 방법은 데이터 집합에 거의 나타나지 않는 희귀 클래스에 대한 상당한 개선을 보여주었다. 또한 본 발명의 실시예에 따른 방법은 사람이 복잡하게 설계한 알고리즘을 필요로 하지 않으며 오히려 모듈이 학습하고 전달할 지식을 결정하도록 하므로 간단하고 사람의 개입이 덜 필요하다. 또한, 본 발명의 실시예에 따른 방법은 다른 최첨단 자가 훈련 UDA 방법에 적용할 수 있으며 추가로 성능을 향상시킬 수 있다. 본 발명의 실시예에 따른 방법은 최근 SOTA 방법 중 하나인 ProDA와 결합했으며 테스트한 두 UDA 벤치마크에서 새로운 SOTA 점수를 설정하는 성능 향상을 관찰할 수 있었다.In an embodiment of the present invention, extensive experiments were performed on two standard UDA benchmarks, GTA5_Cityscapes and SYNTHIA_Cityscapes. We showed that the network trained in this way outperforms the network trained only by self-training. In particular, the method according to an embodiment of the present invention showed a significant improvement for rare classes that rarely appear in the data set. In addition, the method according to the embodiment of the present invention does not require an algorithm designed by a complicated human, but rather allows the module to learn and determine the knowledge to be transmitted, so it is simple and requires less human intervention. In addition, the method according to an embodiment of the present invention can be applied to other state-of-the-art self-training UDA methods and can further improve performance. The method according to an embodiment of the present invention was combined with ProDA, which is one of the recent SOTA methods, and it was possible to observe the performance improvement of setting a new SOTA score in the two UDA benchmarks tested.

도 7은 본 발명의 실시예에 따른 비지도 도메인 적응 장치의 구동과정을 나타내는 흐름도이다.7 is a flowchart illustrating a driving process of an apparatus for unsupervised domain adaptation according to an embodiment of the present invention.

설명의 편의상 도 7을 도 1과 함께 참조하면, 본 발명의 실시예에 따른 비지도 도메인 적응 장치(90)는 레이블 정보가 있는 소스 이미지의 레이블 정보를 이용하여 소스 이미지에 포함되는 도메인(예: 하늘, 보도나 울타리, 자동차 등을 포함하는 영역)을 의미별로 세분화하여 분석되는 특징 정보를 소스 도메인 데이터로서 저장한다(S700). 여기서, 소스 도메인 데이터는 도메인의 특징 정보와 레이블 정보가 매칭되는 형태로 구성될 수 있다.For convenience of explanation, referring to FIG. 7 together with FIG. 1 , the unsupervised domain adaptation apparatus 90 according to an embodiment of the present invention uses label information of a source image having label information to include a domain (eg: The characteristic information analyzed by subdividing the sky, the area including sidewalks, fences, automobiles, etc. by meaning is stored as source domain data (S700). Here, the source domain data may be configured in a form in which characteristic information of a domain and label information are matched.

또한, 비지도 도메인 적응 장치(90)는 (기)저장한 소스 도메인 데이터를 학습한 학습 모델을 이용해 (가령, 예측에 의해) 레이블 정보가 없는 타겟 이미지의 의사 레이블 정보를 생성하고, 소스 도메인 데이터를 이용해 학습한 픽셀간 상관 관계의 학습 결과를 근거로 (기)생성한 의사 레이블 정보의 오류 부분을 보완하는 방식으로 타겟 이미지의 도메인을 (기)저장한 소스 도메인 데이터에 적응시켜 학습을 수행한다(S710).In addition, the unsupervised domain adaptation device 90 generates pseudo label information of a target image without label information (eg, by prediction) using a learning model that has learned the (pre)stored source domain data, and the source domain data Based on the learning result of the correlation between pixels learned using (S710).

이를 위하여, 비지도 도메인 적응 장치(90)는 픽셀간 상관 관계의 학습을 위한 자기 집중 모듈을 포함할 수 있으며, 해당 모듈에서는 위의 하늘, 보도, 자동차 등과 같이 고정 불변 속성을 가지는 정보 즉 GT 레이블 정보가 있는 소스 도메인에 대해서만 학습하여 픽셀 간 상관관계가 잘 학습되도록 한다.To this end, the unsupervised domain adaptation device 90 may include a self-focusing module for learning the inter-pixel correlation, and in the module, information having fixed and invariant properties such as the sky above, the sidewalk, the car, etc., that is, the GT label By learning only the informational source domain, the inter-pixel correlation is well learned.

상기한 내용 이외에도 도 1의 비지도 도메인 적응 장치(90)는 다양한 동작을 수행할 수 있으며, 기타 자세한 내용은 앞서 충분히 설명하였으므로 그 내용들로 대신하고자 한다.In addition to the above, the unsupervised domain adaptation apparatus 90 of FIG. 1 can perform various operations, and other detailed information has been sufficiently described above, so it will be replaced with the contents.

한편, 본 발명의 실시 예를 구성하는 모든 구성 요소들이 하나로 결합하거나 결합하여 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시 예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성 요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다. 또한, 그 모든 구성요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성 요소들의 그 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수 개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. 그 컴퓨터 프로그램을 구성하는 코드들 및 코드 세그먼트들은 본 발명의 기술분야의 당업자에 의해 용이하게 추론될 수 있을 것이다. 이러한 컴퓨터 프로그램은 컴퓨터가 읽을 수 있는 비일시적 저장매체(non-transitory computer readable media)에 저장되어 컴퓨터에 의하여 읽혀지고 실행됨으로써, 본 발명의 실시 예를 구현할 수 있다.On the other hand, even though it has been described that all components constituting the embodiment of the present invention are combined or operated in combination, the present invention is not necessarily limited to this embodiment. That is, within the scope of the object of the present invention, all the components may operate by selectively combining one or more. In addition, all of the components may be implemented as one independent hardware, but a part or all of each component is selectively combined to perform some or all functions of the combined components in one or a plurality of hardware program modules It may be implemented as a computer program having Codes and code segments constituting the computer program can be easily deduced by those skilled in the art of the present invention. Such a computer program is stored in a computer-readable non-transitory computer readable media, read and executed by a computer, thereby implementing an embodiment of the present invention.

여기서 비일시적 판독 가능 기록매체란, 레지스터, 캐시(cache), 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라, 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로, 상술한 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리 카드, ROM 등과 같은 비일시적 판독가능 기록매체에 저장되어 제공될 수 있다.Here, the non-transitory readable recording medium refers to a medium that stores data semi-permanently and can be read by a device, not a medium that stores data for a short moment, such as a register, cache, memory, etc. . Specifically, the above-described programs may be provided by being stored in a non-transitory readable recording medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM, and the like.

이상에서는 본 발명의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안 될 것이다.In the above, preferred embodiments of the present invention have been illustrated and described, but the present invention is not limited to the specific embodiments described above, and it is common in the technical field to which the present invention pertains without departing from the gist of the present invention as claimed in the claims. Various modifications may be made by those having the knowledge of, of course, and these modifications should not be individually understood from the technical spirit or perspective of the present invention.

100: 통신 인터페이스부 110: 제어부
120: 비지도 도메인 적응부 130: 저장부100: communication interface unit 110: control unit
120: unsupervised domain adaptation unit 130: storage unit

Claims

a storage unit for storing, as source domain data, characteristic information analyzed by semantic segmentation of domains included in the source image by using the label information of a source image having label information; and
Pseudo label information of a target image without label information is generated using a learning model that has learned the stored source domain data, and based on the learning result of the correlation between pixels learned using the source domain data, the A control unit that performs learning by adapting the domain of the target image to the stored source domain data in a way that compensates for the error part of the generated pseudo label information;
The control unit is configured to connect the output of the segmentation unit that subdivides the domain of the target image by meaning and the output of the self-attention part that learns the inter-pixel correlation to each other to synchronize the data. An unsupervised domain adaptation device in semantic segmentation utilizing

According to claim 1,
The control unit, when learning the inter-pixel correlation using the source domain data, the inter-pixel correlation for learning the inter-pixel correlation only for source domain data having GT (Ground Truth) label information related to ground truth information. An unsupervised domain adaptation device in semantic segmentation utilizing semantic segmentation.

3. The method of claim 2,
The control unit uses information having a fixed and invariant attribute as the GT label information, and an apparatus for unsupervised domain adaptation in semantic segmentation that utilizes a correlation between pixels using a GT label of a source domain.

According to claim 1,
The control unit generates target domain data by matching the domain subdivided by meaning of the target image with the pseudo label information with the error part changed, and a correlation between pixels using the generated target domain data when learning a new target image An unsupervised domain adaptation device in semantic segmentation utilizing

delete

storing, by a storage unit, the characteristic information analyzed by subdividing domains included in the source image by meaning by using the label information of the source image having the label information as source domain data; and
A control unit generates pseudo-label information of a target image without label information using a learning model that has learned the stored source domain data, and generates the pseudo-label information based on a learning result of the correlation between pixels learned using the source domain data. Including; performing learning by adapting the domain of the target image to the stored source domain data in a manner that compensates for the error part of the pseudo label information;
The step of performing the learning is,
Unsupervised domain adaptation in semantic segmentation that utilizes inter-pixel correlation that synchronizes data by linking the output of the segmentation unit that subdivides the domain of the target image by meaning and the output of the magnetic concentrator learning the inter-pixel correlation with each other How to operate the device.

7. The method of claim 6,
The step of performing the learning is,
When learning the inter-pixel correlation using the source domain data, unsupervised in semantic segmentation that utilizes inter-pixel correlation that learns inter-pixel correlation only for source domain data having GT label information related to ground truth information A method of driving a domain adaptation device.

8. The method of claim 7,
The step of performing the learning is,
A method of driving an unsupervised domain adaptation apparatus in semantic segmentation using information having a fixed and invariant property as the GT label information and utilizing a correlation between pixels using a GT label of a source domain.

7. The method of claim 6,
The step of performing the learning is,
generating target domain data by matching the domain subdivided for each meaning of the target image with the pseudo label information in which the error part is changed; and
Using the generated target domain data when learning a new target image;
A method of driving an unsupervised domain adaptive apparatus in semantic segmentation using correlation between pixels.

delete