KR102607748B1

KR102607748B1 - Apparatus and method for image analysis applying multi-task adaptation

Info

Publication number: KR102607748B1
Application number: KR1020220088899A
Authority: KR
Inventors: 최종원; 정하욱; 신준섭; 강영욱
Original assignee: 중앙대학교 산학협력단
Priority date: 2022-07-19
Filing date: 2022-07-19
Publication date: 2023-11-29

Abstract

레이블이 부여되지 않은 영상 데이터로부터 깊이 정보 및 의미론적 분할 정보를 추출하기 위한 영상 분석 장치 및 방법이 개시된다. 일 실시예에 따른 영상 분석 장치는 입력 이미지 데이터를 수신하는 입력부; 및 입력 이미지 데이터로부터 깊이 지도(depth map) 및 의미론적 분할 지도(semantic segmentation map)를 생성하는 분석부를 포함할 수 있다.An image analysis device and method for extracting depth information and semantic segmentation information from unlabeled image data are disclosed. An image analysis device according to an embodiment includes an input unit that receives input image data; and an analysis unit that generates a depth map and a semantic segmentation map from input image data.

Description

Image analysis apparatus and method applying multi-task adaptation {Apparatus and method for image analysis applying multi-task adaptation}

레이블이 부여되지 않은 영상 데이터로부터 깊이 정보 및 의미론적 분할 정보를 추출하기 위한 영상 분석 장치 및 방법에 관한 것이다.It relates to an image analysis device and method for extracting depth information and semantic segmentation information from unlabeled image data.

최근 자율주행에 대한 관심이 높아짐에 따라 다양한 유형의 작업과 캡처 환경을 포함하는 다양한 데이터 세트가 등장하였다. 그러나 자율주행을 위한 실제 데이터를 수집하려면 높은 라벨링 비용 뿐만 아니라 값비싼 캡처 차량이 필요하다. 또한 데이터 세트 규모의 한계로 인해 대규모의 실제 자율주행 환경을 커버하기 어려운 문제가 있다.Recently, as interest in autonomous driving has increased, various data sets have emerged that include different types of tasks and capture environments. However, collecting real-world data for autonomous driving requires expensive capture vehicles as well as high labeling costs. Additionally, there is a problem in covering large-scale real autonomous driving environments due to limitations in the size of the data set.

이러한 문제를 극복하기 위해 레이블이 지정된 도메인을 동일한 작업에 사용하여 레이블이 지정되지 않은 도메인에 대해 만족스러운 성능을 달성하기 위한 UDA(Unsupervised Domain Adaptation)에 대한 연구가 수행되고 있다.To overcome these problems, research is being conducted on Unsupervised Domain Adaptation (UDA) to achieve satisfactory performance over unlabeled domains by using labeled domains for the same task.

한국등록특허공보 제10-2169243호(2020.10.23)Korean Patent Publication No. 10-2169243 (2020.10.23)

레이블이 부여되지 않은 영상 데이터로부터 깊이 정보 및 의미론적 분할 정보를 추출하기 위한 영상 분석 장치 및 방법을 제공하는데 목적이 있다.The purpose is to provide an image analysis device and method for extracting depth information and semantic segmentation information from unlabeled image data.

일 양상에 따르면, 영상 분석 장치는 입력 이미지 데이터를 수신하는 입력부; 및 입력 이미지 데이터로부터 깊이 지도(depth map) 및 의미론적 분할 지도(semantic segmentation map)를 생성하는 분석부를 포함할 수 있다.According to one aspect, an image analysis device includes an input unit that receives input image data; and an analysis unit that generates a depth map and a semantic segmentation map from input image data.

분석부는 입력 이미지 데이터로부터 특징 정보를 추출하는 인코더를 포함하는 인코더부; 인코더의 특징 정보를 기초로 깊이 특징 정보를 생성하는 제 1 병목 모듈(bottleneck module), 인코더의 특징 정보를 기초로 의미론적 분할 특징 정보를 생성하는 제 2 병목 모듈 및 인코더의 특징 정보를 기초로 재구성 특징 정보를 생성하는 제 3 병목 모듈을 포함하는 병목부; 병목부로부터 수신한 깊이 특징 정보와 의미론적 분할 특징 정보 간 상관 관계(task correlation)를 계산하기 위한 제 1 어텐션 모듈(attention module), 깊이 특징 정보와 재구성 특징 정보 간 상관 관계를 계산하기 위한 제 2 어텐션 모듈 및 의미론적 분할 특징 정보와 재구성 특징 정보 간 상관 관계를 계산하기 위한 제 3 어텐션 모듈을 포함하는 어텐션부; 및 병목부에서 수신한 특징 정보들 및 어텐션부에서 수신한 테스크 상관 관계들 중 적어도 둘 이상의 조합을 이용하여 깊이 지도를 생성하는 깊이 지도 디코더, 의미론적 분할 지도를 생성하는 의미론적 분할 지도 디코더 및 깊이 도메인, 의미론적 분할 도메인 및 입력 이미지 도메인 각각의 재구성을 위한 재구성 디코더들을 포함하는 재구성부를 포함할 수 있다.The analysis unit includes an encoder unit that extracts feature information from input image data; A first bottleneck module that generates depth feature information based on the feature information of the encoder, a second bottleneck module that generates semantic segmentation feature information based on the feature information of the encoder, and reconstruction based on the feature information of the encoder. a bottleneck unit including a third bottleneck module that generates feature information; A first attention module for calculating the correlation (task correlation) between the depth feature information received from the bottleneck and the semantic segmentation feature information, and a second attention module for calculating the correlation between the depth feature information and the reconstruction feature information. an attention unit including an attention module and a third attention module for calculating a correlation between semantic segmentation feature information and reconstruction feature information; and a depth map decoder that generates a depth map using a combination of at least two of the feature information received from the bottleneck and the task correlations received from the attention section, and a semantic segmentation map decoder and depth that generate a semantic segmentation map. It may include a reconstruction unit including reconstruction decoders for reconstruction of each of the domain, semantic partition domain, and input image domain.

깊이 지도 디코더는 병목부로부터 수신한 깊이 특징 정보, 어텐션부로부터 수신한 깊이 특징 정보와 의미론적 분할 특징 정보의 상관 관계 및 깊이 특징 정보와 재구성 특징 정보의 상관 관계를 이용하여 깊이 지도를 생성할 수 있다.The depth map decoder can generate a depth map using the depth feature information received from the bottleneck, the correlation between the depth feature information and semantic segmentation feature information received from the attention section, and the correlation between depth feature information and reconstruction feature information. there is.

의미론적 분할 지도 디코더는 병목부로부터 수신한 의미론적 분할 특징 정보, 어텐션부로부터 수신한 의미론적 분할 특징 정보와 깊이 특징 정보의 상관 관계 및 의미론적 분할 특징 정보와 재구성 특징 정보의 상관 관계를 이용하여 의미론적 분할 지도를 생성할 수 있다.The semantic segmentation map decoder uses the semantic segmentation feature information received from the bottleneck, the correlation between the semantic segmentation feature information and depth feature information received from the attention unit, and the correlation between the semantic segmentation feature information and the reconstruction feature information. A semantic segmentation map can be created.

분석부는 레이블이 부여된 깊이 도메인 학습 데이터, 레이블이 부여된 의미론적 분할 도메인 학습 데이터 및 레이블이 부여되지 않은 입력 이미지 도메인 학습 데이터를 이용하여 지도 학습될 수 있다. 여기서, 레이블이 부여되지 않은 입력 이미지 도메인은 재구성(reconstruction task)를 통해 지도 학습이 이루어질 수 있다. The analysis unit may be supervised learning using labeled depth domain training data, labeled semantic segmentation domain training data, and unlabeled input image domain training data. Here, supervised learning can be performed on the unlabeled input image domain through a reconstruction task.

깊이 지도 디코더는 생성된 깊이 지도와 깊이 도메인 학습 데이터의 레이블에 기초하여 지도 학습되며, 의미론적 분할 지도 디코더는 생성된 의미론적 분할 지도 및 의미론적 분할 도메인 학습 데이터의 레이블에 기초하여 지도 학습되며, 깊이 도메인, 의미론적 분할 도메인 및 입력 이미지 도메인 각각의 재구성을 위한 재구성 디코더 각각은 재구성된 데이터와 깊이 도메인 학습 데이터, 의미론적 분할 도메인 학습 데이터 및 입력 이미지 도메인 학습 데이터에 기초하여 지도 학습될 수 있다. 여기서, 입력 이미지 도메인은 레이블이 부여되지 않은 것이다.The depth map decoder is supervised learning based on the labels of the generated depth map and depth domain training data, and the semantic segmentation map decoder is supervised learning based on the labels of the generated semantic segmentation map and semantic segmentation domain training data. Each of the reconstruction decoders for reconstructing the depth domain, semantic split domain, and input image domain may be supervised based on the reconstructed data, depth domain training data, semantic split domain training data, and input image domain training data. Here, the input image domain is unlabeled.

분석부는 지도 학습 후 깊이 도메인 학습 데이터, 의미론적 분할 도메인 학습 데이터 및 입력 이미지 도메인 학습 데이터를 조합하여 생성된 레이블이 부여되지 않은 혼합 학습 데이터를 기초로 비지도 학습될 수 있다.The analysis unit may be subjected to unsupervised learning based on unlabeled mixed learning data generated by combining depth domain learning data, semantic segmentation domain learning data, and input image domain learning data after supervised learning.

혼합 학습 데이터는 깊이 도메인 학습 데이터에 깊이 혼합 마스크를 적용하여 생성한 깊이 혼합 학습 데이터 및 의미론적 분할 도메인 학습 데이터에 의미론적 분할 혼합 마스크를 적용하여 생성한 의미론적 분할 혼합 학습 데이터를 결합한 후, 나머지 영역에 입력 이미지 도메인 학습 데이터를 결합하여 생성될 수 있다.The mixed learning data is obtained by combining the depth mixed learning data generated by applying a depth mixing mask to the depth domain learning data and the semantic segmentation mixed learning data generated by applying the semantic segmentation mixing mask to the semantic segmentation domain learning data, and the remaining It can be created by combining input image domain learning data with a region.

깊이 도메인 학습 데이터에 깊이 혼합 마스크를 적용하여 생성한 깊이 혼합 학습 데이터 및 의미론적 분할 도메인 학습 데이터에 의미론적 분할 혼합 마스크를 적용하여 생성한 의미론적 분할 혼합 학습 데이터가 중복되는 영역의 경우, 깊이 혼합 학습 데이터의 깊이 및 의미론적 분할 혼합 학습 데이터의 깊이 중 가까운 깊이를 가지는 데이터를 선택하여 결합할 수 있다.In the case of areas where depth mixing training data generated by applying a depth mixing mask to depth domain training data and semantic segmentation mixing learning data generated by applying a semantic segmentation mixing mask to semantic segmentation domain training data overlap, depth mixing Among the depth of learning data and semantic segmentation mixed learning data, data with a closer depth can be selected and combined.

분석부는 분석부와 동일한 모델로 구성된 지수 이동 평균(Exponential Moving Average, EMA) 모델에 도메인 학습 데이터, 의미론적 분할 도메인 학습 데이터 및 입력 이미지 도메인 학습 데이터를 입력하여 생성한 의사(pseudo) 깊이 레이블 및 의사 의미론적 분할 레이블 및 분석부에 혼합 학습 데이터를 입력하여 생성한 깊이 지도 및 의미론적 분할 지도의 차이를 기초로 비지도 학습될 수 있다.The analysis unit generates pseudo depth labels and pseudo-depth labels by inputting domain learning data, semantic segmentation domain learning data, and input image domain learning data into an Exponential Moving Average (EMA) model composed of the same model as the analysis unit. Unsupervised learning can be done based on the difference between the depth map and the semantic segmentation map generated by inputting mixed learning data into the semantic segmentation label and analysis unit.

일 양상에 따르면, 영상 분석 방법은 입력 이미지 데이터를 수신하는 입력 단계; 및 입력 이미지 데이터로부터 깊이 지도(depth map) 및 의미론적 분할 지도(semantic segmentation map)를 생성하는 분석 단계를 포함할 수 있다.According to one aspect, an image analysis method includes an input step of receiving input image data; And it may include an analysis step of generating a depth map and a semantic segmentation map from the input image data.

분석 단계는 인코더를 이용하여 입력 이미지 데이터로부터 특징 정보를 추출하며, 병목 모듈을 이용하여 인코더의 특징 기초로 깊이 특징 정보, 의미론적 분할 특징 정보 및 재구성 특징 정보를 생성하며, 어텐션 모듈을 이용하여 깊이 특징 정보와 의미론적 분할 특징 정보 간 상관 관계(task correlation), 깊이 특징 정보와 재구성 특징 정보 간 상관 관계 및 의미론적 분할 특징 정보와 재구성 특징 정보 간 상관 관계를 계산하며, 디코더를 이용하여 특징 정보들 및 테스크 상관 관계들 중 적어도 둘 이상의 조합을 이용하여 깊이 지도 생성, 의미론적 분할 지도 생성 및 깊이 도메인, 의미론적 분할 도메인 및 입력 이미지 도메인 각각을 재구성할 수 있다.In the analysis stage, feature information is extracted from input image data using an encoder, depth feature information, semantic segmentation feature information, and reconstruction feature information are generated based on the encoder's features using a bottleneck module, and depth feature information is generated using an attention module. The correlation between feature information and semantic segmentation feature information (task correlation), the correlation between depth feature information and reconstruction feature information, and the correlation between semantic segmentation feature information and reconstruction feature information are calculated, and the feature information is collected using a decoder. and a combination of at least two of the task correlations can be used to generate a depth map, generate a semantic segmentation map, and reconstruct each of the depth domain, semantic segmentation domain, and input image domain.

깊이 지도 생성을 위한 디코더는 깊이 특징 정보, 깊이 특징 정보와 의미론적 분할 특징 정보의 상관 관계 및 깊이 특징 정보와 재구성 특징 정보의 상관 관계를 이용하여 깊이 지도를 생성할 수 있다.A decoder for generating a depth map can generate a depth map using depth feature information, the correlation between depth feature information and semantic segmentation feature information, and the correlation between depth feature information and reconstruction feature information.

의미론적 분할 지도 생성을 위한 디코더는 의미론적 분할 특징 정보, 어텐션부로부터 수신한 의미론적 분할 특징 정보와 깊이 특징 정보의 상관 관계 및 의미론적 분할 특징 정보와 재구성 특징 정보의 상관 관계를 이용하여 의미론적 분할 지도를 생성할 수 있다.The decoder for generating a semantic segmentation map uses semantic segmentation feature information, the correlation between semantic segmentation feature information and depth feature information received from the attention unit, and the correlation between semantic segmentation feature information and reconstruction feature information. A segmentation map can be created.

분석 단계는 레이블이 부여된 깊이 도메인 학습 데이터, 레이블이 부여된 의미론적 분할 도메인 학습 데이터 및 레이블이 부여되지 않은 입력 이미지 도메인 학습 데이터를 이용하여 지도 학습될 수 있다. The analysis step may be supervised learning using labeled depth domain training data, labeled semantic segmentation domain training data, and unlabeled input image domain training data.

깊이 지도 생성을 위한 디코더는 생성된 깊이 지도와 깊이 도메인 학습 데이터의 레이블에 기초하여 지도 학습되며, 의미론적 분할 지도 생성을 위한 디코더는 생성된 의미론적 분할 지도 및 의미론적 분할 도메인 학습 데이터의 레이블에 기초하여 지도 학습되며, 깊이 도메인, 의미론적 분할 도메인 및 입력 이미지 도메인 각각의 재구성을 위한 재구성 디코더 각각은 재구성된 데이터와 깊이 도메인 학습 데이터, 의미론적 분할 도메인 학습 데이터 및 입력 이미지 도메인 학습 데이터에 기초하여 지도 학습될 수 있다.The decoder for generating a depth map is supervised based on the labels of the generated depth map and depth domain training data, and the decoder for generating a semantic segmentation map is trained based on the labels of the generated semantic segmentation map and semantic segmentation domain training data. Based on the supervised learning, each reconstruction decoder for reconstruction of the depth domain, semantic segmentation domain, and input image domain is based on the reconstructed data, depth domain training data, semantic segmentation domain training data, and input image domain training data. It can be supervised learning.

분석부 단계는 지도 학습 후 깊이 도메인 학습 데이터, 의미론적 분할 도메인 학습 데이터 및 입력 이미지 도메인 학습 데이터를 조합하여 생성된 레이블이 부여되지 않은 혼합 학습 데이터를 기초로 비지도 학습될 수 있다.The analysis unit stage may be unsupervised learning based on unlabeled mixed learning data generated by combining depth domain learning data, semantic segmentation domain learning data, and input image domain learning data after supervised learning.

분석 단계는 분석 모델과 동일한 모델로 구성된 지수 이동 평균(Exponential Moving Average, EMA) 모델에 도메인 학습 데이터, 의미론적 분할 도메인 학습 데이터 및 입력 이미지 도메인 학습 데이터를 입력하여 생성한 의사(pseudo) 깊이 레이블 및 의사 의미론적 분할 레이블 및 분석부에 혼합 학습 데이터를 입력하여 생성한 깊이 지도 및 의미론적 분할 지도의 차이를 기초로 비지도 학습될 수 있다.The analysis step is a pseudo depth label and It can be unsupervised based on the difference between the depth map and the semantic segmentation map generated by inputting mixed learning data into the pseudo-semantic segmentation label and analysis unit.

일 실시예에 따를 경우, 레이블이 부여되지 않은 영상 데이터로부터 깊이 정보 및 의미론적 분할 정보를 추출할 수 있다.According to one embodiment, depth information and semantic segmentation information can be extracted from unlabeled image data.

도 1은 일 실시예에 따른 영상 분석 장치의 구성도이다.
도 2는 일 실시예에 따른 분석부의 구성도이다.
도 3은 일 실시예에 따른 영상 분석 장치의 학습 방법을 설명하기 위한 예시도이다.
도 4는 일 예에 따른 혼합 학습 데이터의 생성 방법을 설명하기 위한 예시도이다.
도 5는 일 실시예에 따른 영상 분석 방법을 도시한 흐름도이다.
도 6은 예시적인 실시예들에서 사용되기에 적합한 컴퓨팅 장치를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도이다.1 is a configuration diagram of an image analysis device according to an embodiment.
Figure 2 is a configuration diagram of an analysis unit according to an embodiment.
Figure 3 is an example diagram for explaining a learning method of an image analysis device according to an embodiment.
Figure 4 is an example diagram for explaining a method of generating mixed learning data according to an example.
Figure 5 is a flowchart illustrating an image analysis method according to an embodiment.
6 is a block diagram illustrating and illustrating a computing environment including a computing device suitable for use in example embodiments.

이하, 첨부된 도면을 참조하여 본 발명의 일 실시예를 상세하게 설명한다. 본 발명을 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 또한, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로, 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.Hereinafter, an embodiment of the present invention will be described in detail with reference to the attached drawings. In describing the present invention, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description will be omitted. In addition, the terms described below are terms defined in consideration of functions in the present invention, and may vary depending on the intention or custom of the user or operator. Therefore, the definition should be made based on the content throughout this specification.

이하, 영상 분석 장치 및 방법의 실시예들을 도면들을 참고하여 자세히 설명한다.Hereinafter, embodiments of the image analysis device and method will be described in detail with reference to the drawings.

도 1은 일 실시예에 따른 영상 분석 장치의 구성도이다.1 is a configuration diagram of an image analysis device according to an embodiment.

일 실시예에 따르면, 영상 분석 장치(100)는 입력 이미지 데이터를 수신하는 입력부(110) 및 입력 이미지 데이터로부터 깊이 지도(depth map) 및 의미론적 분할 지도(semantic segmentation map)를 생성하는 분석부(120)를 포함할 수 있다.According to one embodiment, the image analysis device 100 includes an input unit 110 that receives input image data and an analysis unit that generates a depth map and a semantic segmentation map from the input image data ( 120) may be included.

일 예로, 영상 분석 장치(100)는 두 개의 가상 소스 도메인과 하나의 실제 대상 도메인으로 구성된 세 가지 유형의 도메인을 이용하여 학습될 수 있다. 예를 들어, 두 가상 소스 도메인은 각각 의미론적 분할에 대한 레이블이 부여된 이미지 데이터(Ds)와 깊이에 대한 레이블이 부여된 이미지 데이터(Dd)일 수 있으며, 실제 대상 도메인은 레이블 정보 없이 실제 환경에서 캡처된 이미지 데이터(Dt) 일 수 있다. 이에 따라, 영상 분석 장치(100)는 의미론적 분할에 대한 레이블이 부여된 데이터와 깊이에 대한 레이블이 부여된 데이터를 이용하여 레이블이 부여되지 않은 실제 대상 데이터에 대한 다중 작업 예측을 수행하도록 학습될 수 있다. 예를 들어, 영상 분석 장치(100)는 다중 작업 예측을 통하여 깊이 지도 및 의미론적 분할 지도를 생성할 수 있다.As an example, the image analysis device 100 may be trained using three types of domains consisting of two virtual source domains and one actual target domain. For example, the two virtual source domains could be labeled image data for semantic segmentation (Ds) and labeled image data for depth (Dd), respectively, and the real target domain could be the real environment without label information. It may be image data (Dt) captured from . Accordingly, the image analysis device 100 will be trained to perform multi-task prediction on unlabeled actual target data using labeled data for semantic segmentation and data labeled for depth. You can. For example, the image analysis device 100 may generate a depth map and a semantic segmentation map through multi-task prediction.

도 2는 일 실시예에 따른 분석부의 구성도이다.Figure 2 is a configuration diagram of an analysis unit according to an embodiment.

일 실시예에 따르면, 분석부(120)는 입력 이미지 데이터로부터 특징 정보를 추출하는 인코더(encoder)를 포함하는 인코더부(121)를 포함할 수 있다. According to one embodiment, the analysis unit 120 may include an encoder unit 121 that includes an encoder that extracts feature information from input image data.

일 실시예에 따르면, 분석부(120)는 인코더의 특징 정보를 기초로 깊이 특징 정보를 생성하는 제 1 병목 모듈(bottleneck 1), 인코더의 특징 정보를 기초로 의미론적 분할 특징 정보를 생성하는 제 2 병목 모듈(bottleneck 2) 및 인코더의 특징 정보를 기초로 재구성 특징 정보를 생성하는 제 3 병목 모듈(bottleneck 3)을 포함하는 병목부(123)를 포함할 수 있다.According to one embodiment, the analysis unit 120 includes a first bottleneck module (bottleneck 1) that generates depth feature information based on the feature information of the encoder, and a second bottleneck module that generates semantic segmentation feature information based on the feature information of the encoder. It may include a bottleneck unit 123 that includes two bottleneck modules (bottleneck 2) and a third bottleneck module (bottleneck 3) that generates reconstructed feature information based on feature information of the encoder.

일 예에 따르면, 병목 모듈은 깊이 추정, 의미론적 분할 및 재구성을 위한 작업별 특징을 생성할 수 있다. According to one example, the bottleneck module may generate task-specific features for depth estimation, semantic segmentation, and reconstruction.

일 예로, 제 1 병목 모듈은 인코더로부터 특징 정보를 수신하여 깊이 특징 정보를 생성할 수 있다. As an example, the first bottleneck module may receive feature information from an encoder and generate depth feature information.

일 예에 따르면, 제 1 병목 모듈의 출력인 깊이 특징 정보는 초기 깊이 디코더(init depth decoder)로 전송될 수 있으며, 초기 깊이 디코더는 초기 깊이 지도를 생성할 수 있다. 일 예로, 재구성부(127)에 포함된 깊이 지도 디코더는 깊이 지도 디코더에서 생성한 깊이 지도 및 초기 깊이 디코더에서 생성한 초기 깊이 지도를 기초로 계산된 손실함수를 이용하여 학습될 수 있다.According to one example, depth feature information that is the output of the first bottleneck module may be transmitted to an initial depth decoder, and the initial depth decoder may generate an initial depth map. As an example, the depth map decoder included in the reconstruction unit 127 may be trained using a loss function calculated based on the depth map generated by the depth map decoder and the initial depth map generated by the initial depth decoder.

일 예로, 제 2 병목 모듈은 인코더로부터 특징 정보를 수신하여 의미론적 분할 특징 정보를 생성할 수 있다. As an example, the second bottleneck module may receive feature information from the encoder and generate semantic segmentation feature information.

일 예에 따르면, 제 2 병목 모듈의 출력인 의미론적 분할 특징 정보는 초기 의미론적 분할 지도 디코더(init semantic decoder)로 전송될 수 있으며, 초기 의미론적 분할 디코더는 초기 의미론적 분할 지도를 생성할 수 있다. 일 예로, 재구성부(127)에 포함된 의미론적 분할 지도 디코더는 의미론적 분할 지도 디코더에서 생성한 의미론적 분할 지도 및 초기 의미론적 분할 지도 디코더에서 생성한 초기 의미론적 분할 지도를 기초로 계산된 손실함수를 이용하여 학습될 수 있다.According to one example, the semantic segmentation feature information that is the output of the second bottleneck module may be transmitted to an initial semantic segmentation map decoder (init semantic decoder), and the initial semantic segmentation decoder may generate an initial semantic segmentation map. there is. As an example, the semantic segmentation map decoder included in the reconstruction unit 127 calculates a loss based on the semantic segmentation map generated by the semantic segmentation map decoder and the initial semantic segmentation map generated by the initial semantic segmentation map decoder. It can be learned using functions.

일 예로, 제 3 병목 모듈은 인코더로부터 특징 정보를 수신하여 재구성 특징 정보를 생성할 수 있다. As an example, the third bottleneck module may receive feature information from the encoder and generate reconstructed feature information.

일 예에 따르면, 제 3 병목 모듈의 출력인 깊이 특징 정보는 초기 깊이 도메인 재구성 디코더(DD Init Reconstruction Decoder), 초기 의미론적 분할 도메인 재구성 디코더(SD Init Reconstruction Decoder) 및 초기 입력 이미지 도메인 재구성 디코더(UD Init Reconstruction Decoder)에 전송될 수 있다. According to one example, the depth feature information that is the output of the third bottleneck module is an initial depth domain reconstruction decoder (DD Init Reconstruction Decoder), an initial semantic segmentation domain reconstruction decoder (SD Init Reconstruction Decoder), and an initial input image domain reconstruction decoder (UD). Init Reconstruction Decoder).

일 예로, 초기 디코더의 구성 및 아키텍처는 재구성부(127)에 포함된 디코더와 동일할 수 있다.For example, the configuration and architecture of the initial decoder may be the same as the decoder included in the reconstruction unit 127.

일 실시예에 따르면, 분석부(120)는 병목부(123)로부터 수신한 깊이 특징 정보와 의미론적 분할 특징 정보 간 상관 관계(task correlation)를 계산하기 위한 제 1 어텐션 모듈(attention 1), 깊이 특징 정보와 재구성 특징 정보 간 상관 관계를 계산하기 위한 제 2 어텐션 모듈(attention 2) 및 의미론적 분할 특징 정보와 재구성 특징 정보 간 상관 관계를 계산하기 위한 제 3 어텐션 모듈(attention 3)을 포함하는 어텐션부(125)를 포함할 수 있다.According to one embodiment, the analysis unit 120 includes a first attention module (attention 1) for calculating a correlation (task correlation) between the depth feature information received from the bottleneck 123 and the semantic segmentation feature information, a depth Attention including a second attention module (attention 2) for calculating the correlation between feature information and reconstructed feature information and a third attention module (attention 3) for calculating the correlation between semantic segmentation feature information and reconstructed feature information. It may include unit 125.

일 예에 따르면, 병목 모듈의 작업별 특징은 주의 모듈에 제공되며, 이는 재구성부의 디코더에 대한 입력으로 사용될 수 있다. 일 예로, 주의 모듈은 두 벡터가 서로 곱해지는 셀프 주의 블록을 사용하여 구성될 수 있다.According to one example, task-specific features of the bottleneck module are provided to the attention module, which can be used as input to the decoder of the reconstruction unit. As an example, the attention module may be constructed using a self-attention block in which two vectors are multiplied together.

일 실시예에 따르면, 분석부(120)는 병목부(123)에서 수신한 특징 정보들 및 어텐션부(125)에서 수신한 테스크 상관 관계들 중 적어도 둘 이상의 조합을 이용하여 깊이 지도를 생성하는 깊이 지도(depth map) 디코더, 의미론적 분할 지도를 생성하는 의미론적 분할 지도(semantic map) 디코더 및 깊이 도메인(depth domain), 의미론적 분할 도메인(semantic domain) 및 입력 이미지 도메인(input image domain) 각각의 재구성을 위한 재구성 디코더들을 포함하는 재구성부(127)를 포함할 수 있다.According to one embodiment, the analysis unit 120 generates a depth map using a combination of at least two of the feature information received from the bottleneck unit 123 and the task correlations received from the attention unit 125. A depth map decoder, a semantic segmentation map decoder that generates a semantic segmentation map, and a depth domain, a semantic segment domain, and an input image domain, respectively. It may include a reconstruction unit 127 including reconstruction decoders for reconstruction.

일 예에 따르면, 재구성부(127)에 포함된 5개의 디코더는 입력 데이터 Dd, Ds 및 Dt에 대해 각각 1개의 깊이 지도 디코더, 1개의 분할 디코더 및 3개의 도메인 단위 재구성 디코더로 구성될 수 있다. 일 예로, 깊이 및 의미론적 분할 지도 디코더는 확장 및 패딩 시리즈가 [6, 12, 18, 24]로 설정된 4개의 확장 컨볼루션 레이어로 구축될 수 있다. 입력 이미지 데이터(I)에 대하여 Dd, Ds 및 Dt 각각에 대한 깊이 지도 디코더, 의미론적 분할 지도 디코더 및 재구성 디코더의 출력을 D(I), S(I), Rd(I), Rs(I) 및 Rt(I) 나타낼 수 있다. According to one example, the five decoders included in the reconstruction unit 127 may be composed of one depth map decoder, one split decoder, and three domain-unit reconstruction decoders for input data Dd, Ds, and Dt, respectively. As an example, the depth and semantic segmentation map decoder can be built with four expansion convolutional layers with expansion and padding series set to [6, 12, 18, 24]. For the input image data (I), the outputs of the depth map decoder, semantic segmentation map decoder, and reconstruction decoder for Dd, Ds, and Dt, respectively, are expressed as D(I), S(I), Rd(I), and Rs(I). and Rt(I).

일 실시예에 따르면, 깊이 지도 디코더는 병목부로부터 수신한 깊이 특징 정보, 어텐션부로부터 수신한 깊이 특징 정보와 의미론적 분할 특징 정보의 상관 관계 및 깊이 특징 정보와 재구성 특징 정보의 상관 관계를 이용하여 깊이 지도를 생성할 수 있다. According to one embodiment, the depth map decoder uses the depth feature information received from the bottleneck unit, the correlation between the depth feature information received from the attention unit and the semantic segmentation feature information, and the correlation between the depth feature information and the reconstruction feature information. A depth map can be created.

도 2를 참조하면, 깊이 지도 디코더는 병목부(123)의 제 1 병목 모듈로부터 깊이 특징 정보를 입력 받을 수 있으며, 어텐션부(125)로부터 깊이 특징 정보와 의미론적 분할 특징 정보의 상관 관계 및 깊이 특징 정보와 재구성 특징 정보의 상관 관계를 수신할 수 있다. 이후, 깊이 지도 디코더는 입력된 정보를 이용하여 깊이 지도를 생성할 수 있다. Referring to Figure 2, the depth map decoder can receive depth feature information from the first bottleneck module of the bottleneck unit 123, and the correlation and depth between depth feature information and semantic segmentation feature information from the attention unit 125. The correlation between feature information and reconstructed feature information can be received. Afterwards, the depth map decoder can generate a depth map using the input information.

일 실시예에 따르면, 의미론적 분할 지도 디코더는 병목부로부터 수신한 의미론적 분할 특징 정보, 어텐션부로부터 수신한 의미론적 분할 특징 정보와 깊이 특징 정보의 상관 관계 및 의미론적 분할 특징 정보와 재구성 특징 정보의 상관 관계를 이용하여 의미론적 분할 지도를 생성할 수 있다. According to one embodiment, the semantic segmentation map decoder determines the semantic segmentation feature information received from the bottleneck unit, the correlation between the semantic segmentation feature information and the depth feature information received from the attention unit, and the semantic segmentation feature information and reconstruction feature information. A semantic segmentation map can be created using the correlation of .

도 2를 참조하면, 의미론적 분할 지도 디코더는 병목부(123)의 제 2 병목 모듈로부터 의미론적 분할 특징 정보를 입력 받을 수 있으며, 어텐션부(125)로부터 의미론적 분할 특징 정보와 깊이 특징 정보의 상관 관계 및 의미론적 분할 특징 정보와 재구성 특징 정보의 상관 관계를 수신할 수 있다. 이후, 의미론적 분할 지도 디코더는 입력된 정보를 이용하여 의미론적 분할 지도를 생성할 수 있다. Referring to FIG. 2, the semantic segmentation map decoder may receive semantic segmentation feature information from the second bottleneck module of the bottleneck unit 123, and may receive semantic segmentation feature information and depth feature information from the attention unit 125. Correlation and correlation between semantic segmentation feature information and reconstruction feature information may be received. Afterwards, the semantic segmentation map decoder can generate a semantic segmentation map using the input information.

일 예에 따르면, 분석부(120)에 대한 훈련 손실은 감독 손실(supervision loss)과 적응 손실(adaptation loss)로 구분되는 두 가지 훈련 손실의 합일 수 있다. 감독 손실은 원본 입력 이미지와 해당 정답 레이블을 사용하여 네트워크를 훈련하도록 설계되었으며 적응 손실은 혼합 학습 데이터(TripleMix)를 사용하여 도메인 간격을 줄이기 위한 통합 훈련을 수행하도록 설계될 수 있다.According to one example, the training loss for the analysis unit 120 may be the sum of two training losses divided into supervision loss and adaptation loss. The supervised loss is designed to train the network using the original input images and the corresponding answer labels, and the adaptive loss can be designed to perform integrated training to reduce the domain gap using mixed learning data (TripleMix).

일 실시예에 따르면, 분석부(120)는 레이블이 부여된 깊이 도메인 학습 데이터, 레이블이 부여된 의미론적 분할 도메인 학습 데이터 및 레이블이 부여되지 않은 입력 이미지 도메인 학습 데이터를 이용하여 지도 학습될 수 있다.According to one embodiment, the analysis unit 120 may be supervised using labeled depth domain training data, labeled semantic segmentation domain training data, and unlabeled input image domain training data. .

일 예를 들어, 분석부(120)는 레이블이 부여된 깊이 도메인 학습 데이터(Dd)를 이용하여 깊이 지도 디코더 및 깊이 도메인 재구성 디코더를 지도 학습할 수 있다. For example, the analysis unit 120 may supervised learning of a depth map decoder and a depth domain reconstruction decoder using labeled depth domain learning data (Dd).

일 예로, 깊이 지도 디코더는 생성된 깊이 지도와 깊이 도메인 학습 데이터의 레이블에 기초하여 지도 학습될 수 있다. 예를 들어, 깊이 지도 디코더는 깊이 도메인 학습 데이터에 대한 깊이 지도를 생성할 수 있으며, 생성된 깊이 지도와 깊이 도메인 학습 데이터의 레이블을 비교하여 학습될 수 있다. As an example, a depth map decoder may be supervised based on the labels of the generated depth map and depth domain learning data. For example, a depth map decoder can generate a depth map for depth domain training data and can be trained by comparing the generated depth map with the labels of the depth domain training data.

일 예로, 깊이 도메인 재구성 디코더는 깊이 도메인 학습 데이터(Dd)를 이용하여 학습될 수 있다. 예를 들어, 깊이 도메인 재구성 디코더는 깊이 도메인 학습 데이터에 대한 깊이 도메인 재구성 데이터를 생성할 수 있으며, 생성된 깊이 도메인 재구성 데이터와 깊이 도메인 학습 데이터 자체를 비교하여 학습될 수 있다.As an example, the depth domain reconstruction decoder can be learned using depth domain learning data (Dd). For example, a depth domain reconstruction decoder may generate depth domain reconstruction data for depth domain training data, and may be trained by comparing the generated depth domain reconstruction data with the depth domain training data itself.

일 예를 들어, 분석부(120)는 레이블이 부여된 의미론적 분할 도메인 학습 데이터를 이용하여 의미론적 분할 지도 디코더 및 의미론적 분할 도메인 재구성 디코더를 학습할 수 있다.For example, the analysis unit 120 may learn a semantic segmentation map decoder and a semantic segmentation domain reconstruction decoder using labeled semantic segmentation domain learning data.

일 예로, 의미론적 분할 지도 디코더는 생성된 의미론적 분할 지도 및 의미론적 분할 도메인 학습 데이터의 레이블에 기초하여 지도 학습될 수 있다. 예를 들어, 의미론적 분할 지도 디코더는 의미론적 분할 도메인 학습 데이터에 대한 의미론적 분할 지도를 생성할 수 있으면, 생성된 의미론적 분할 지도와 의미론적 분할 도메인 학습 데이터의 레이블을 비교하여 학습될 수 있다. As an example, a semantic segmentation map decoder may be supervised based on the labels of the generated semantic segmentation map and semantic segmentation domain learning data. For example, if a semantic segmentation map decoder can generate a semantic segmentation map for semantic segmentation domain training data, it can be trained by comparing the generated semantic segmentation map with the labels of the semantic segmentation domain training data. .

일 예로, 의미론적 분할 도메인 재구성 디코더는 의미론적 분할 도메인 학습 데이터에 기초하여 지도 학습될 수 있다. 예를 들어, 의미론적 분할 도메인 재구성 디코더는 의미론적 분할 도메인 학습 데이터에 대한 의미론적 분할 재구성 데이터를 생성할 수 있으며, 생성된 의미론적 분할 재구성 데이터와 의미론적 분할 도메인 학습 데이터 자체를 비교하여 학습될 수 있다. As an example, a semantic split domain reconstruction decoder may be supervised learning based on semantic split domain learning data. For example, a semantic split domain reconstruction decoder can generate semantic split reconstruction data for semantic split domain training data, and can be learned by comparing the generated semantic split domain training data with the semantic split domain training data itself. You can.

일 예에 따르면, 깊이 도메인, 의미론적 분할 도메인 및 입력 이미지 도메인 각각의 재구성을 위한 재구성 디코더 각각은 재구성된 데이터와 깊이 도메인 학습 데이터, 의미론적 분할 도메인 학습 데이터 및 입력 이미지 도메인 학습 데이터에 기초하여 지도 학습될 수 있다. 예를 들어, 각각의 재구성 디코더는 입력된 각각의 도메인 데이터에 대응하는 재구성 데이터를 생성할 수 있으며, 생성된 재구성 데이터와 입력된 도메인 데이터를 각각 비교하여 학습될 수 있다. According to one example, each of the reconstruction decoders for reconstruction of the depth domain, semantic segmentation domain, and input image domain is guided based on the reconstructed data and the depth domain training data, semantic segmentation domain training data, and input image domain training data. It can be learned. For example, each reconstruction decoder may generate reconstruction data corresponding to each input domain data, and may be learned by comparing the generated reconstruction data and the input domain data, respectively.

일 실시예에 따르면, 분석부(120)는 지도 학습 후 깊이 도메인 학습 데이터, 의미론적 분할 도메인 학습 데이터 및 입력 이미지 도메인 학습 데이터를 조합하여 생성된 레이블이 부여되지 않은 혼합 학습 데이터를 기초로 비지도 학습될 수 있다. According to one embodiment, the analysis unit 120 performs unsupervised training based on unlabeled mixed learning data generated by combining depth domain learning data, semantic segmentation domain learning data, and input image domain learning data after supervised learning. It can be learned.

일 실시예에 따르면, 혼합 학습 데이터는 깊이 도메인 학습 데이터에 깊이 혼합 마스크를 적용하여 생성한 깊이 혼합 학습 데이터 및 의미론적 분할 도메인 학습 데이터에 의미론적 분할 혼합 마스크를 적용하여 생성한 의미론적 분할 혼합 학습 데이터를 결합한 후, 나머지 영역에 입력 이미지 도메인 학습 데이터를 결합하여 생성할 수 있다. According to one embodiment, the mixed learning data includes depth mixed learning data generated by applying a depth mixing mask to depth domain learning data, and semantic segmentation mixed learning generated by applying a semantic segmentation mixing mask to semantic segmentation domain learning data. After combining the data, it can be generated by combining the input image domain learning data with the remaining areas.

일 예로, 혼합 학습 데이터(315)는 Ds, Dd, Dt의 훈련 세트(311)에서 각각 추출된 Is, Id, It의 세 이미지를 혼합하여 생성할 수 있다. 이를 위하여 추출할 영역을 결정하는 Is 및 Id에 대한 마스크를 추정할 수 있으며, Is 및 Id에 대한 마스크에 해당하지 않는 나머지 영역은 It의 픽셀로 채울 수 있다. Is 및 Id에 대한 마스크는 각각 Ms 및 Md로 나타낼 수 있다. As an example, the mixed learning data 315 may be created by mixing three images, Is, Id, and It, respectively extracted from the training sets 311 of Ds, Dd, and Dt. For this purpose, masks for Is and Id that determine the area to be extracted can be estimated, and the remaining areas that do not correspond to the masks for Is and Id can be filled with pixels of It. Masks for Is and Id can be denoted as Ms and Md, respectively.

도 4를 참조하면, 혼합 학습 데이터(420)는 깊이 도메인 학습 데이터에 깊이 도메인 혼합 마스크(411)을 적용하여 생성된 깊이 혼합 학습 데이터, 의미론적 분할 도메인 학습 데이터에 의미론적 분할 혼합 마스크(415)을 적용하여 생성된 의미론적 분할 혼합 학습 데이터 및 입력 이미지 도메인 학습 데이터에 입력 이미지 혼합 마스크(413)을 적용하여 생성된 입력 이미지 혼합 학습 데이터를 결합하여 생성할 수 있다. Referring to FIG. 4, mixed learning data 420 is depth mixed learning data generated by applying a depth domain mixing mask 411 to depth domain learning data, and a semantic segmentation mixing mask 415 to semantic segmentation domain learning data. It can be generated by combining the semantic segmentation mixed learning data generated by applying and the input image mixed learning data generated by applying the input image mixing mask 413 to the input image domain learning data.

일 예로, 롱테일 문제를 처리하기 위해 아래 수학식과 같이 Is의 클래스별 픽셀 수를 전체 픽셀 수로 나눌 수 있다.As an example, to handle the long tail problem, the number of pixels for each class of Is can be divided by the total number of pixels as shown in the equation below.

[수학식 1][Equation 1]

여기서 c ∈ {1, ..., C}는 C 카테고리에 대한 클래스 인덱스이고, n_c는 c번째 클래스로 레이블이 지정된 픽셀 수를 의미하며, N은 전체 픽셀 수를 나타낸다.Here, c ∈ {1, ..., C} is the class index for category C, n _c refers to the number of pixels labeled with the c-th class, and N refers to the total number of pixels.

일 예로, 깊이에 대한 초기 혼합 마스크(M_o ^d)를 추정하기 위하여 연속 깊이 값을 지정된 범위로 분리하여 깊이 추정을 위한 개별 픽셀 별 범주를 생성할 수 있다. 예를 들어, 중첩된 범위가 없는 깊이 범주를 결정하기 위해 임계값을 [θ1,θ2,θ3,...,θNd]로 고정할 수 있다. 여기서 Nd는 깊이 범주의 수를 나타내며, 모든 카테고리에서 픽셀이 균일하게 분포되도록 임계값을 결정할 수 있다. 이후, 균일한 무작위 분포를 가정하여 Nd 깊이 범주의 절반을 무작위로 샘플링한 다음 M_o ^d는 선택한 깊이 범주의 픽셀에서만 1의 값을 가지며 그렇지 않으면 0의 값을 가지도록 설정할 수 있다.As an example, in order to estimate the initial mixing mask (M _o ^d ) for depth, continuous depth values can be separated into designated ranges to create individual pixel-specific categories for depth estimation. For example, the threshold can be fixed to [θ1,θ2,θ3,...,θNd] to determine depth categories without overlapping ranges. Here, Nd represents the number of depth categories, and a threshold can be determined to ensure that pixels are distributed uniformly across all categories. Afterwards, half of the Nd depth categories can be randomly sampled, assuming a uniform random distribution, and then M _o ^d can be set to have a value of 1 only in pixels of the selected depth category, and 0 otherwise.

일 실시예에 따르면, 깊이 도메인 학습 데이터에 깊이 혼합 마스크를 적용하여 생성한 깊이 혼합 학습 데이터 및 의미론적 분할 도메인 학습 데이터에 의미론적 분할 혼합 마스크를 적용하여 생성한 의미론적 분할 혼합 학습 데이터가 중복되는 영역의 경우, 깊이 혼합 학습 데이터의 깊이 및 의미론적 분할 혼합 학습 데이터의 깊이 중 가까운 깊이를 가지는 데이터를 선택하여 결합할 수 있다.According to one embodiment, depth mixed learning data generated by applying a depth mixing mask to depth domain learning data and semantic segmentation mixed learning data generated by applying a semantic segmentation mixing mask to semantic segmentation domain learning data are overlapping. In the case of a region, data with a closer depth among the depth of the depth mixed learning data and the depth of the semantic segmentation mixed learning data can be selected and combined.

일 예에 따르면, Id와 Is를 혼합하기 전에 M_o ^s와 M_o ^d 사이의 겹친 영역을 제거해야 한다. 예를 들어, Is와 Id 사이의 기하학적 일관성을 고려하기 위해 지수 이동 평균(Exponential Moving Average, 320) 모델에서 예측된 깊이 지도를 이용할 수 있다. 예를 들어, 가까운 개체는 먼 영역을 가려야 하므로 M_o ^s 및 M_o ^d의 겹치는 픽셀에 대해 가까운 이미지 픽셀이 선택될 수 있다. 따라서 중첩된 픽셀 위치 x에 대해 M_o ^s 및 M_o ^d를 각각 복제한 후 정제된 혼합 마스크 Ms 및 Md를 아래 수학식과 같이 추정할 수 있다.According to one example, the overlapping region between M _o ^s and M _o ^d must be removed before mixing I d and I s. For example, a depth map predicted from an Exponential Moving Average (320) model can be used to consider geometric consistency between Is and Id. For example, close objects may need to obscure distant areas, so nearby image pixels may be selected for overlapping pixels of M _o ^s and M _o ^d . Therefore, after replicating M _o ^s and M _o ^d for each of the overlapping pixel positions x, the refined mixing masks Ms and M d can be estimated as shown in the equation below.

[수학식 2] [Equation 2]

여기서 GTd(x)는 위치 x에서 Id에 대한 정답 깊이 값을 나타낸다. Here GTd(x) represents the correct depth value for Id at location x.

일 예로, Ms와 Md를 이용하여 혼합 학습 데이터를 생성할 수 있다. Is와 Id의 영역은 Ms와 Md에 따라 각각 추출될 수 있으며, Is와 Id를 혼합한 후 빈 영역을 채우기 위해 It에서 나머지 영역을 추출할 수 있다. 이에 따라, 혼합 학습 데이터(TripleMix) 이미지 I+는 아래 수학식과 같이 생성될 수 있다.As an example, mixed learning data can be generated using Ms and Md. The areas of Is and Id can be extracted respectively according to Ms and Md, and after mixing Is and Id, the remaining area can be extracted from It to fill the empty area. Accordingly, the mixed learning data (TripleMix) image I+ can be generated as shown in the equation below.

[수학식 3][Equation 3]

여기서, 는 픽셀 단위 곱(pixel-wise multiplication)을 나타낸다.here, represents pixel-wise multiplication.

일 예로, I+의 의사 레이블은 실측 지도와 EMA(Exponential Moving Average) 모델의 예측을 사용하여 획득할 수 있다. 예를 들어, GT와 GTd에 의해 정답 의미론적 분할 지도(ground-truth segmentation map)과 정답 깊이 지도(ground-truth depth map)을 각각 정의할 수 있으며, 의미론적 분할(S+) 및 깊이 추정(D+)을 위한 I+의 의사 레이블은 아래 수학식과 같이 획득될 수 있다.As an example, the pseudo label of I+ can be obtained using a ground truth map and predictions of an Exponential Moving Average (EMA) model. For example, a ground-truth segmentation map and a ground-truth depth map can be defined by GT and GTd, respectively, and semantic segmentation (S+) and depth estimation (D+ ) The pseudo label of I+ can be obtained as in the equation below.

[수학식 4] [Equation 4]

후처리의 경우 추가적인 사전 지식을 사용하여 I+의 유사 레이블을 수정할 수 있다. 일 예로, '하늘'로 지정된 픽셀의 깊이는 무한해야 하므로 S+에서 '하늘' 픽셀로 지정된 픽셀의 D+ 값은 무한 값으로 설정될 수 있다. 따라서, 해당 위치에서 D+의 깊이 값이 무한대인 경우, S+의 픽셀 클래스를 '하늘' 클래스로 변경할 수 있다.For post-processing, additional prior knowledge can be used to modify the pseudo-labels of I+. For example, since the depth of a pixel designated as 'sky' must be infinite, the D+ value of a pixel designated as 'sky' pixel in S+ may be set to an infinite value. Therefore, if the depth value of D+ is infinite at that location, the pixel class of S+ can be changed to the 'sky' class.

일 실시예에 따르면, 분석부는 분석부와 동일한 모델로 구성된 지수 이동 평균(Exponential Moving Average, EMA) 모델에 도메인 학습 데이터, 의미론적 분할 도메인 학습 데이터 및 입력 이미지 도메인 학습 데이터를 입력하여 생성한 의사(pseudo) 깊이 레이블 및 의사 의미론적 분할 레이블 및 분석부에 혼합 학습 데이터를 입력하여 생성한 깊이 지도 및 의미론적 분할 지도의 차이를 기초로 비지도 학습될 수 있다.According to one embodiment, the analysis unit inputs domain learning data, semantic segmentation domain learning data, and input image domain learning data into an Exponential Moving Average (EMA) model composed of the same model as the analysis unit, and generates a doctor ( pseudo) Depth label and pseudo-semantic segmentation Label and pseudo-semantic segmentation can be unsupervised based on the difference between the depth map and the semantic segmentation map generated by inputting mixed learning data into the analysis unit.

일 예로, 훈련 손실을 추정하기 전에 먼저 혼합 학습 데이터(TripleMix) 이미지의 유사 레이블에 대한 픽셀 단위 신뢰도를 예측할 수 있다. 혼합 학습 데이터(TripleMix) 이미지의 신뢰도는 재구성된 이미지의 불일치 맵과 의사 레이블의 불확실성 점수를 곱하여 추정될 수 있다.As an example, before estimating the training loss, the pixel-level confidence for similar labels in the mixed learning data (TripleMix) image can be predicted. The reliability of a mixed learning data (TripleMix) image can be estimated by multiplying the uncertainty score of the pseudo label with the disparity map of the reconstructed image.

일 예에 따르면, 서로 다른 도메인에서 개별적으로 학습된 경우에도 임베딩 기능이 백본 네트워크에서 잘 통합된 경우 세 가지 재구성 모듈의 출력이 혼합 학습 데이터(TripleMix) 이미지와 동일해야 한다. 이를 기초로 3개의 재구성 모듈에서 각각 재구성된 이미지의 불일치를 활용할 수 있다. 예를 들어, 재구성된 이미지 간의 유클리드 거리를 기반으로 분류되지 않은 영역에 대한 세분화 작업 불일치 맵과 레이블이 지정되지 않은 깊이 영역에 대한 깊이 작업 불일치 맵을 획득할 수 있다. 이는 아래와 같이 나타낼 수 있다. According to one example, the outputs of the three reconstruction modules should be identical to the mixed learning data (TripleMix) image if the embedding features are well integrated in the backbone network, even if trained separately in different domains. Based on this, the inconsistencies in the reconstructed images can be utilized in each of the three reconstruction modules. For example, based on the Euclidean distance between reconstructed images, a segmentation task disparity map for unclassified regions and a depth task disparity map for unlabeled depth regions can be obtained. This can be expressed as follows.

[수학식 5] [Equation 5]

여기서, 이며, 은 입력값을 0 내지 1로 제한하기 위한 클리핑 연산을 나타낸다.here, and represents a clipping operation to limit the input value to 0 to 1.

일 예에 따르면, Id와 It에서 분할 의사 레이블의 불확실성 점수를 결정하기 위해 최대 확률을 활용할 수 있다. 예를 들어, 의미론적 분할에 대한 최대 확률이 미리 정의된 임계값을 초과하는 픽셀 수를 아래 수학식과 같이 계산할 수 있다.According to one example, the maximum probability can be utilized to determine the uncertainty score of the split pseudo label in Id and It. For example, the number of pixels for which the maximum probability for semantic segmentation exceeds a predefined threshold can be calculated using the equation below.

[수학식 6][Equation 6]

여기서 X는 이미지의 가능한 모든 위치를 포함하며, δ 함수는 주어진 조건이 충족되면 1을 출력하고, 그렇지 않으면 0출력한다. ζs는 사용자 정의 하이퍼파라미터를 나타낸다.Here, ζs represents a user-defined hyperparameter.

일 예로, 깊이 추정의 불확실성 점수는 증대 분산에 의해 추정될 수 있다. 불확실한 예측이 증강된 입력 이미지에 대해 불안정할 것이라는 직관에 기초하여 깊이 의사 레이블에 대한 불확실성 점수를 아래 수학식과 같이 추정할 수 있다. As an example, the uncertainty score of depth estimation can be estimated by augmented variance. Based on the intuition that the uncertain prediction will be unstable for the augmented input image, the uncertainty score for the depth pseudo label can be estimated as shown in the equation below.

[수학식 7][Equation 7]

여기서 σ 함수의 출력은 입력 벡터의 분산이고, Ij는 j ∈ {1, ..., J}인 I의 증강 이미지이며, ζd는 사용자 정의 하이퍼파라미터를 나타낸다.Here, the output of the σ function is the variance of the input vector, Ij is the augmented image of I with j ∈ {1, ..., J}, and ζd represents the user-defined hyperparameter.

일 예에 따르면, 레이블이 있는 도메인의 관점에서 다른 두 도메인은 레이블이 없는 도메인이다. 이에 따라, 레이블이 지정되지 않은 두 도메인의 유사 레이블의 신뢰성을 계산할 수 있다. 예를 들어, 의미론적 분할 및 깊이 작업의 신뢰도 맵으로 Bs와 Bd를 각각 획득할 수 있다. 예를 들어, 추정된 불일치 맵과 불확실성 점수를 곱하여 Bs와 Bd를 아래 수학식과 같이 계산할 수 있다.According to one example, from the perspective of the labeled domain, the other two domains are unlabeled domains. Accordingly, the reliability of similar labels in two unlabeled domains can be calculated. For example, Bs and Bd can be obtained as confidence maps for semantic segmentation and depth tasks, respectively. For example, Bs and Bd can be calculated by multiplying the estimated discrepancy map and uncertainty score as shown in the equation below.

[수학식 8][Equation 8]

일 예에 따르면, 분석부는 의미론적 분할, 깊이 추정, 영역별 재구성 작업의 세 가지 작업을 수행할 수 있다. 일 에로, 의미론적 분할을 위해 교차 엔트로피 손실은 아래 수학식과 같이 계산할 수 있다.According to one example, the analysis unit can perform three tasks: semantic segmentation, depth estimation, and region-specific reconstruction. First, for semantic segmentation, the cross entropy loss can be calculated as shown in the equation below.

[수학식 9][Equation 9]

여기서 S와 는 각각 목표 및 예측된 분할 맵을 나타내고, B는 기대치에 대한 가중 맵이며, X는 입력 이미지의 모든 위치를 포함하는 집합을 나타낸다.Here S and represents the target and predicted segmentation maps, respectively, B is the weighted map for the expectations, and X represents the set containing all positions of the input image.

일 예로, 깊이 추정 손실은 실제 깊이와 예측 깊이 맵의 차이로 아래 수학식과 같이 계산될 수 있다.As an example, the depth estimation loss can be calculated as the difference between the actual depth and the predicted depth map as shown in the equation below.

[수학식 10][Equation 10]

여기서 D와 는 각각 목표 및 예측 깊이 맵이며, 는 반전된 후버놈(Huber norm)을 나타낸다. Here with D are the target and predicted depth maps, respectively, represents the inverted Huber norm.

일 예로, 재구성 손실은 프로베니우스 놈(Frobenius norm)을 이용하여 아래 수학식과 같이 나타낼 수 있다.As an example, the reconstruction loss can be expressed as the equation below using the Frobenius norm.

[수학식 11][Equation 11]

여기서 I 및 는 각각 대상 및 재구성된 이미지를 의미하며, 는 프로베니우스 놈(Frobenius norm)을 나타낸다.where I and refers to the target and reconstructed image, respectively, represents the Frobenius norm.

일 예로, 전체 훈련 손실은 아래 수학식과 같이 감독 손실 L_su와 적응 손실 L_da로 나타낼 수 있다.As an example, the total training loss can be expressed as supervision loss L _su and adaptation loss L _da as shown in the equation below.

[수학식 12][Equation 12]

일 예로, 전체 손실은 백본 네트워크를 포함한 전체 네트워크 아키텍처를 업데이트하는 데 사용될 수 있다.As an example, the total loss can be used to update the entire network architecture, including the backbone network.

예를 들어, 지도 학습에서 해당 레이블이 지정된 작업에서 각 도메인의 손실을 계산할 수 있다 초기 작업 모듈의 학습을 함께 고려하기 위해 감독 손실을 아래 수학식과 같이 정의할 수 있다.For example, in supervised learning, the loss of each domain can be calculated from the corresponding labeled task. To consider the learning of the initial task module together, the supervised loss can be defined as in the equation below.

[수학식 13][Equation 13]

여기서 1은 해당 크기의 1로만 평가되는 맵을 나타낸다.Here, 1 represents a map that evaluates only to 1 of that size.

일 예로, 도메인 적응을 위해 L_da는 의사 레이블의 신뢰성을 기반으로 혼합 학습 데이터(TripleMix) 이미지의 신뢰할 수 있는 의사 레이블을 학습시킬 수 있도록 설정될 수 있다. 예를 들어, 아래 수학식과 같이 의사 레이블을 대상 맵으로 고려하여 혼합 학습 데이터(TripleMix) 이미지에 대한 학습 손실을 계산할 수 있다.As an example, for domain adaptation, L _da can be set to learn reliable pseudo labels of mixed learning data (TripleMix) images based on the reliability of pseudo labels. For example, the learning loss for the mixed learning data (TripleMix) image can be calculated by considering the pseudo label as the target map as shown in the equation below.

[수학식 14][Equation 14]

도 5는 일 실시예에 따른 영상 분석 방법을 도시한 흐름도이다.Figure 5 is a flowchart showing an image analysis method according to an embodiment.

일 실시예에 따르면, 영상 분석 장치는 입력 이미지 데이터를 수신하는 입력받을 수 있으며(510), 입력 이미지 데이터로부터 깊이 지도(depth map) 및 의미론적 분할 지도(semantic segmentation map)를 생성할 수 있다(520).According to one embodiment, the image analysis device can receive input image data (510) and generate a depth map and a semantic segmentation map from the input image data (510). 520).

일 실시예에 따르면, 영상 분석 장치는 인코더를 이용하여 입력 이미지 데이터로부터 특징 정보를 추출하며, 병목 모듈을 이용하여 인코더의 특징 기초로 깊이 특징 정보, 의미론적 분할 특징 정보 및 재구성 특징 정보를 생성하며, 어텐션 모듈을 이용하여 깊이 특징 정보와 의미론적 분할 특징 정보 간 상관 관계(task correlation), 깊이 특징 정보와 재구성 특징 정보 간 상관 관계 및 의미론적 분할 특징 정보와 재구성 특징 정보 간 상관 관계를 계산하며, 디코더를 이용하여 특징 정보들 및 테스크 상관 관계들 중 적어도 둘 이상의 조합을 이용하여 깊이 지도 생성, 의미론적 분할 지도 생성 및 깊이 도메인, 의미론적 분할 도메인 및 입력 이미지 도메인 각각을 재구성할 수 있다.According to one embodiment, the image analysis device extracts feature information from input image data using an encoder, and generates depth feature information, semantic segmentation feature information, and reconstruction feature information based on the features of the encoder using a bottleneck module. , Using the attention module, the correlation between depth feature information and semantic segmentation feature information (task correlation), the correlation between depth feature information and reconstruction feature information, and the correlation between semantic segmentation feature information and reconstruction feature information are calculated, Using a decoder, it is possible to generate a depth map, generate a semantic segmentation map, and reconstruct each of the depth domain, semantic segmentation domain, and input image domain using a combination of at least two of the feature information and task correlations.

도 5의 실시예에 대한 설명 중 도 1 내지 도 4를 참조하여 설명한 내용과 중복되는 설명은 생략한다.In the description of the embodiment of FIG. 5, descriptions that overlap with those described with reference to FIGS. 1 to 4 will be omitted.

도 6은 예시적인 실시예들에서 사용되기에 적합한 컴퓨팅 장치를 포함하는 컴퓨팅 환경(10)을 예시하여 설명하기 위한 블록도이다. 도시된 실시예에서, 각 컴포넌트들은 이하에 기술된 것 이외에 상이한 기능 및 능력을 가질 수 있고, 이하에 기술된 것 이외에도 추가적인 컴포넌트를 포함할 수 있다.FIG. 6 is a block diagram illustrating and illustrating a computing environment 10 including computing devices suitable for use in example embodiments. In the illustrated embodiment, each component may have different functions and capabilities in addition to those described below, and may include additional components in addition to those described below.

도시된 컴퓨팅 환경(10)은 컴퓨팅 장치(12)를 포함한다. 일 실시예에서, 컴퓨팅 장치(12)는 영상 분석 장치(100)일 수 있다. The illustrated computing environment 10 includes a computing device 12 . In one embodiment, computing device 12 may be image analysis device 100.

컴퓨팅 장치(12)는 적어도 하나의 프로세서(14), 컴퓨터 판독 가능 저장 매체(16) 및 통신 버스(18)를 포함한다. 프로세서(14)는 컴퓨팅 장치(12)로 하여금 앞서 언급된 예시적인 실시예에 따라 동작하도록 할 수 있다. 예컨대, 프로세서(14)는 컴퓨터 판독 가능 저장 매체(16)에 저장된 하나 이상의 프로그램들을 실행할 수 있다. 상기 하나 이상의 프로그램들은 하나 이상의 컴퓨터 실행 가능 명령어를 포함할 수 있으며, 상기 컴퓨터 실행 가능 명령어는 프로세서(14)에 의해 실행되는 경우 컴퓨팅 장치(12)로 하여금 예시적인 실시예에 따른 동작들을 수행하도록 구성될 수 있다.Computing device 12 includes at least one processor 14, a computer-readable storage medium 16, and a communication bus 18. Processor 14 may cause computing device 12 to operate in accordance with the example embodiments noted above. For example, processor 14 may execute one or more programs stored on computer-readable storage medium 16. The one or more programs may include one or more computer-executable instructions, which, when executed by the processor 14, cause computing device 12 to perform operations according to example embodiments. It can be.

컴퓨터 판독 가능 저장 매체(16)는 컴퓨터 실행 가능 명령어 내지 프로그램 코드, 프로그램 데이터 및/또는 다른 적합한 형태의 정보를 저장하도록 구성된다. 컴퓨터 판독 가능 저장 매체(16)에 저장된 프로그램(20)은 프로세서(14)에 의해 실행 가능한 명령어의 집합을 포함한다. 일 실시예에서, 컴퓨터 판독 가능 저장 매체(16)는 메모리(랜덤 액세스 메모리와 같은 휘발성 메모리, 비휘발성 메모리, 또는 이들의 적절한 조합), 하나 이상의 자기 디스크 저장 디바이스들, 광학 디스크 저장 디바이스들, 플래시 메모리 디바이스들, 그 밖에 컴퓨팅 장치(12)에 의해 액세스되고 원하는 정보를 저장할 수 있는 다른 형태의 저장 매체, 또는 이들의 적합한 조합일 수 있다.Computer-readable storage medium 16 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information. The program 20 stored in the computer-readable storage medium 16 includes a set of instructions executable by the processor 14. In one embodiment, computer-readable storage medium 16 includes memory (volatile memory, such as random access memory, non-volatile memory, or an appropriate combination thereof), one or more magnetic disk storage devices, optical disk storage devices, flash It may be memory devices, another form of storage medium that can be accessed by computing device 12 and store desired information, or a suitable combination thereof.

통신 버스(18)는 프로세서(14), 컴퓨터 판독 가능 저장 매체(16)를 포함하여 컴퓨팅 장치(12)의 다른 다양한 컴포넌트들을 상호 연결한다.Communication bus 18 interconnects various other components of computing device 12, including processor 14 and computer-readable storage medium 16.

컴퓨팅 장치(12)는 또한 하나 이상의 입출력 장치(24)를 위한 인터페이스를 제공하는 하나 이상의 입출력 인터페이스(22) 및 하나 이상의 네트워크 통신 인터페이스(26)를 포함할 수 있다. 입출력 인터페이스(22) 및 네트워크 통신 인터페이스(26)는 통신 버스(18)에 연결된다. 입출력 장치(24)는 입출력 인터페이스(22)를 통해 컴퓨팅 장치(12)의 다른 컴포넌트들에 연결될 수 있다. 예시적인 입출력 장치(24)는 포인팅 장치(마우스 또는 트랙패드 등), 키보드, 터치 입력 장치(터치패드 또는 터치스크린 등), 음성 또는 소리 입력 장치, 다양한 종류의 센서 장치 및/또는 촬영 장치와 같은 입력 장치, 및/또는 디스플레이 장치, 프린터, 스피커 및/또는 네트워크 카드와 같은 출력 장치를 포함할 수 있다. 예시적인 입출력 장치(24)는 컴퓨팅 장치(12)를 구성하는 일 컴포넌트로서 컴퓨팅 장치(12)의 내부에 포함될 수도 있고, 컴퓨팅 장치(12)와는 구별되는 별개의 장치로 컴퓨팅 장치(12)와 연결될 수도 있다.Computing device 12 may also include one or more input/output interfaces 22 and one or more network communication interfaces 26 that provide an interface for one or more input/output devices 24. The input/output interface 22 and the network communication interface 26 are connected to the communication bus 18. Input/output device 24 may be coupled to other components of computing device 12 through input/output interface 22. Exemplary input/output devices 24 include, but are not limited to, a pointing device (such as a mouse or trackpad), a keyboard, a touch input device (such as a touchpad or touch screen), a voice or sound input device, various types of sensor devices, and/or imaging devices. It may include input devices and/or output devices such as display devices, printers, speakers, and/or network cards. The exemplary input/output device 24 may be included within the computing device 12 as a component constituting the computing device 12, or may be connected to the computing device 12 as a separate device distinct from the computing device 12. It may be possible.

이제까지 본 발명에 대하여 그 바람직한 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 따라서, 본 발명의 범위는 전술한 실시 예에 한정되지 않고 특허 청구범위에 기재된 내용과 동등한 범위 내에 있는 다양한 실시 형태가 포함되도록 해석되어야 할 것이다.So far, the present invention has been examined focusing on its preferred embodiments. A person skilled in the art to which the present invention pertains will understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Accordingly, the scope of the present invention is not limited to the above-described embodiments, but should be construed to include various embodiments within the scope equivalent to the content described in the patent claims.

100: 영상 분석 장치
110: 입력부
120: 분석부
121: 인코더부
123: 병목부
125: 어텐션부
127: 재구성부100: video analysis device
110: input unit
120: analysis department
121: Encoder unit
123: bottleneck
125: Attention unit
127: Reconstruction unit

Claims

an input unit that receives input image data; and
An analysis unit that generates a depth map and a semantic segmentation map from the input image data,
The analysis unit,
an encoder unit including an encoder that extracts feature information from the input image data;
A first bottleneck module that generates depth feature information based on the feature information of the encoder, a second bottleneck module that generates semantic segmentation feature information based on the feature information of the encoder, and feature information of the encoder a bottleneck unit including a third bottleneck module that generates reconstruction feature information on a basis;
A first attention module for calculating the correlation (task correlation) between the depth feature information received from the bottleneck and the semantic segmentation feature information, and a first attention module for calculating the correlation between the depth feature information and the reconstruction feature information. an attention unit including two attention modules and a third attention module for calculating a correlation between semantic segmentation feature information and reconstruction feature information; and
A depth map decoder that generates a depth map using a combination of at least two of the feature information received from the bottleneck unit and the task correlations received from the attention unit, a semantic segmentation map decoder that generates a semantic segmentation map, and It includes a reconstruction unit including reconstruction decoders for reconstruction of each of the depth domain, semantic segmentation domain, and input image domain,
The analysis unit,
Supervised learning is performed using labeled depth domain training data, labeled semantic segmentation domain training data, and unlabeled input image domain training data, and after supervised learning, depth domain training data and semantic segmentation domain training are performed. Unsupervised learning is performed based on unlabeled mixed learning data generated by combining data and input image domain learning data.
The mixed learning data is,
Depth blended learning data and semantic segmentation created by applying a depth blending mask to determine the extraction region to the depth domain training data. Semantic segmentation created by applying a semantic segmentation blending mask to determine the extraction region to the domain training data. An image analysis device generated by combining mixed learning data and then combining input image domain learning data with remaining areas that do not correspond to the depth mixing mask and the semantic segmentation mixing mask.

delete

According to claim 1,
The depth map decoder is
Image analysis that generates a depth map using depth feature information received from the bottleneck, correlation between depth feature information and semantic segmentation feature information received from the attention portion, and correlation between depth feature information and reconstruction feature information. Device.

According to claim 1,
The semantic segmentation map decoder is
A semantic segmentation map is created using the semantic segmentation feature information received from the bottleneck, the correlation between the semantic segmentation feature information and the depth feature information received from the attention unit, and the correlation between the semantic segmentation feature information and the reconstruction feature information. A video analysis device that generates.

delete

According to claim 1,
The depth map decoder is supervised based on the generated depth map and the labels of the depth domain learning data,
A semantic segmentation map decoder is supervised based on the generated semantic segmentation map and the labels of the semantic segmentation domain training data,
Image analysis, where each reconstruction decoder for reconstruction of the depth domain, semantic segmentation domain, and input image domain is supervised based on the reconstructed data, depth domain training data, semantic segmentation domain training data, and input image domain training data. Device.

delete

According to claim 1,
In the case of areas where depth mixing training data generated by applying a depth mixing mask to depth domain training data and semantic segmentation mixing learning data generated by applying a semantic segmentation mixing mask to semantic segmentation domain training data overlap, depth mixing An image analysis device that selects and combines data with a closer depth among the depth of learning data and semantic segmentation mixed learning data.

According to claim 1,
The analysis department
Pseudo depth label and pseudo meaning generated by inputting domain learning data, semantic segmentation domain learning data, and input image domain learning data into an Exponential Moving Average (EMA) model composed of the same model as the analysis unit. logical segmentation labels and
An image analysis device that performs unsupervised learning based on the difference between a depth map and a semantic segmentation map generated by inputting mixed learning data into the analysis unit.

An input step of receiving input image data; and
An analysis step of generating a depth map and a semantic segmentation map from the input image data,
The analysis step is
Feature information is extracted from the input image data using an encoder,
Using a bottleneck module, depth feature information, semantic segmentation feature information, and reconstruction feature information are generated based on the features of the encoder,
Using the attention module, the correlation between depth feature information and semantic segmentation feature information (task correlation), the correlation between depth feature information and reconstruction feature information, and the correlation between semantic segmentation feature information and reconstruction feature information are calculated;
Using a decoder, generate a depth map, generate a semantic segmentation map, and reconstruct each of the depth domain, semantic segmentation domain, and input image domain using a combination of at least two of the feature information and task correlations,
The analysis step is,
Supervised learning is performed using labeled depth domain training data, labeled semantic segmentation domain training data, and unlabeled input image domain training data, and after supervised learning, depth domain training data and semantic segmentation domain training are performed. Unsupervised learning is performed based on unlabeled mixed learning data generated by combining data and input image domain learning data.
The mixed learning data is,
Depth blended learning data and semantic segmentation created by applying a depth blending mask to determine the extraction region to the depth domain training data. Semantic segmentation created by applying a semantic segmentation blending mask to determine the extraction region to the domain training data. An image analysis method generated by combining mixed learning data and then combining input image domain learning data with the remaining regions that do not correspond to the depth mixing mask and the semantic segmentation mixing mask.

delete

According to claim 11,
The decoder for generating the depth map is
An image analysis method that generates a depth map using depth feature information, the correlation between depth feature information and semantic segmentation feature information, and the correlation between depth feature information and reconstruction feature information.

According to claim 11,
The decoder for generating the semantic segmentation map is
An image analysis method that generates a semantic segmentation map using semantic segmentation feature information, the correlation between semantic segmentation feature information and depth feature information received from the attention unit, and the correlation between semantic segmentation feature information and reconstruction feature information. .

delete

According to claim 11,
The decoder for generating the depth map is supervised based on the generated depth map and the labels of the depth domain learning data,
A decoder for generating a semantic segmentation map is supervised based on the generated semantic segmentation map and the labels of the semantic segmentation domain learning data,
Image analysis, where each reconstruction decoder for reconstruction of the depth domain, semantic segmentation domain, and input image domain is supervised based on the reconstructed data, depth domain training data, semantic segmentation domain training data, and input image domain training data. method.

delete

According to claim 11,
In the case of areas where depth mixing training data generated by applying a depth mixing mask to depth domain training data and semantic segmentation mixing learning data generated by applying a semantic segmentation mixing mask to semantic segmentation domain training data overlap, depth mixing An image analysis method that selects and combines data with a closer depth among the depth of learning data and semantic segmentation mixed learning data.

According to claim 11,
The analysis step is
Pseudo depth labels and pseudo-semantic data are generated by inputting domain training data, semantic segmentation domain training data, and input image domain training data into an Exponential Moving Average (EMA) model composed of the same model as the analysis model. split label and
An image analysis method in which unsupervised learning is performed based on the difference between a depth map and a semantic segmentation map generated by inputting mixed learning data into the analysis model.