KR102659399B1

KR102659399B1 - Device and Method for Zero Shot Semantic Segmentation

Info

Publication number: KR102659399B1
Application number: KR1020210165598A
Authority: KR
Inventors: 함범섭; 백동현; 오영민
Original assignee: 연세대학교 산학협력단
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2024-04-19
Also published as: WO2023096011A1; KR20230078134A

Abstract

제로샷 시맨틱 분할 장치 및 방법이 개시된다. 개시된 장치는, 입력 이미지를 입력받아 신경망 연산을 통해 비주얼 특징맵을 출력하는 비주얼 인코더; 클래스별 특징 벡터를 입력받아 신경망 연산을 통해 클래스별 프로토타입 벡터를 출력하는 시맨틱 인코더; 상기 클래스별 프로토타입 벡터와 상기 비주얼 특징맵의 픽셀별 채널 벡터를 비교하여 상기 비주얼 특징맵의 픽셀 각각에 대해 클래스를 지정하는 시맨틱 분할부를 포함하되, 상기 시맨틱 분할부는 특정 픽섹의 채널 벡터와 가장 유사한 프로토타입 벡터에 상응하는 클래스를 해당 픽셀의 클래스로 지정하며, 상기 프로토타입 벡터와 상기 채널 벡터는 동일한 길이로 설정되고, 상기 비주얼 인코더와 시맨틱 인코더는 적어도 하나의 동일한 손실을 공유하여 동시에 학습된다. 개시된 장치 및 방법에 의하면, 학습되지 않은 클래스에 대해 판별적 방식으로 시맨틱 분할을 수행하여 지속적인 분류기 학습을 요구하지 않으며, 학습되지 않은 클래스를 기 학습된 클래스로 분류하는 편향 문제를 저감시킬 수 있는 장점이 있다. A zero-shot semantic segmentation apparatus and method are disclosed. The disclosed device includes a visual encoder that receives an input image and outputs a visual feature map through neural network calculation; A semantic encoder that receives feature vectors for each class and outputs prototype vectors for each class through neural network operations; A semantic division unit that specifies a class for each pixel of the visual feature map by comparing the prototype vector for each class with a channel vector for each pixel of the visual feature map, wherein the semantic division unit determines a class for each pixel of the visual feature map, wherein the semantic division unit determines the channel vector of a specific pixel and A class corresponding to a similar prototype vector is designated as the class of the corresponding pixel, the prototype vector and the channel vector are set to the same length, and the visual encoder and semantic encoder are trained simultaneously by sharing at least one of the same loss. . According to the disclosed device and method, semantic segmentation is performed on untrained classes in a discriminative manner, so continuous classifier learning is not required, and the bias problem of classifying untrained classes into pre-trained classes can be reduced. There is.

Description

Device and Method for Zero Shot Semantic Segmentation}

본 발명은 시맨틱 분할 장치 및 방법에 관한 것으로서, 더욱 상세하게는 학습되지 않은 클래스에 대해서도 시맨틱 분할이 가능한 제로샷 시맨틱 분할 장치 및 방법에 관한 것이다. The present invention relates to a semantic segmentation apparatus and method, and more specifically, to a zero-shot semantic segmentation apparatus and method capable of semantic segmentation even for classes that have not been learned.

시맨틱 분할(semantic segmentation)은 입력 영상을 식별 가능한 클래스 각각에 대응하는 영역별로 분할하는 것을 의미하며, 자율 주행, 의료 영상, 영상 편집 등 다양한 응용 분야에 적용될 수 있다. 이러한 시맨틱 영상 분할은 입력 영상의 다수의 픽셀 각각을 사람, 자동차, 자전거 등과 같은 객체를 지정된 클래스로 분류하여 레이블링하는 것을 목표로 한다.Semantic segmentation means dividing an input image into regions corresponding to each identifiable class, and can be applied to various application fields such as autonomous driving, medical imaging, and image editing. This semantic image segmentation aims to label each of the multiple pixels of the input image by classifying objects such as people, cars, bicycles, etc. into designated classes.

CNN(convolutional neural networks) 과 같은 인공 신경망을 이용하는 딥 러닝 기반 시맨틱 영상 분할 기술들은 우수한 성능을 나타내지만, 학습을 위해서는 각 객체의 클래스가 픽셀 단위로 레이블되어 클래스별 객체 영역이 정확하게 표현된 픽셀 수준(pixel-level)의 학습 데이터가 대량으로 필요하다.Deep learning-based semantic image segmentation technologies that use artificial neural networks such as CNN (convolutional neural networks) show excellent performance, but for learning, the class of each object is labeled in pixel units and the object area for each class is accurately expressed at the pixel level ( A large amount of pixel-level learning data is required.

한편, 제로샷 시맨틱 기술은 학습 과정에서 학습된 클래스뿐만 아니라 학습되지 않은 클래스에 대해서도 추가 정보를 이용하여 시맨틱 분할을 수행하는 기법이다. Meanwhile, zero-shot semantic technology is a technique that performs semantic segmentation using additional information not only for classes learned during the learning process but also for classes that have not been learned.

한편, 제로샷 시맨틱 분할은 학습 시에 학습한 클래스뿐만 아니라 학습되지 않은 클래스에 대해서도 시맨틱 분할을 할 수 있는 기법을 의미한다. 기존의 제로샷 시맨틱 분할은 생성적 방법(Generative method)을 이용하여 시맨틱 분할을 수헹하였다. 생성적 방법은 다단의 스테이지를 통해 시맨틱 분할을 수행하는 기법으로서 학습되지 않은 클래스에 대해서는 특징자를 별도로 생성하여 제로샷 시맨틱 분할을 수행하는 기법이다. 생성적 방법은 최종 스테이지에서 학습되지 않은 클래스에 대한 특징자를 생성한 후 이를 분류기에 입력하여 시맨틱 분할을 수행하도록 한다.Meanwhile, zero-shot semantic segmentation refers to a technique that can perform semantic segmentation not only for classes learned during learning, but also for classes that have not been learned. The existing zero-shot semantic segmentation performed semantic segmentation using a generative method. The generative method is a technique that performs semantic segmentation through multiple stages. It is a technique that performs zero-shot semantic segmentation by separately generating features for unlearned classes. The generative method generates features for unlearned classes in the final stage and then inputs them into the classifier to perform semantic segmentation.

이러한, 생성적 방법은 의미 특징을 고려하지 않고 시맨틱 분할을 수행하기에 편향 문제를 발생시키며, 편향 문제는 학습되지 않은 클래스를 학습된 클래스로 분류하는 문제를 의미한다. This generative method creates a bias problem because it performs semantic segmentation without considering semantic features, and the bias problem refers to the problem of classifying an untrained class into a learned class.

아울러, 기존의 생성적 방법은 새로운 클래스가 새롭게 등장하거나 없어지는 경우 분류기를 매번 새롭게 학습시켜야 하는 문제가 있어 현실적으로 사용하기 어려운 문제점이 있었다. In addition, the existing generative method has the problem of having to retrain the classifier each time when a new class appears or disappears, making it difficult to use in reality.

본 발명은 학습되지 않은 클래스에 대해 판별적 방식으로 시맨틱 분할을 수행하여 지속적인 분류기 학습을 요구하지 않은 제로샷 시맨틱 분할 장치 및 방법을 제안한다. The present invention proposes a zero-shot semantic segmentation device and method that does not require continuous classifier learning by performing semantic segmentation on untrained classes in a discriminative manner.

또한, 본 발명은 학습되지 않은 클래스를 기 학습된 클래스로 분류하는 편향 문제를 저감시킬 수 있는 제로샷 시맨틱 분할 장치 및 방법을 제안한다. Additionally, the present invention proposes a zero-shot semantic segmentation device and method that can reduce the bias problem of classifying an unlearned class into a previously learned class.

본 발명의 일 측면에 따르면, 입력 이미지를 입력받아 신경망 연산을 통해 비주얼 특징맵을 출력하는 비주얼 인코더; 클래스별 특징 벡터를 입력받아 신경망 연산을 통해 클래스별 프로토타입 벡터를 출력하는 시맨틱 인코더; 상기 클래스별 프로토타입 벡터와 상기 비주얼 특징맵의 픽셀별 채널 벡터를 비교하여 상기 비주얼 특징맵의 픽셀 각각에 대해 클래스를 지정하는 시맨틱 분할부를 포함하되, 상기 시맨틱 분할부는 특정 픽섹의 채널 벡터와 가장 유사한 프로토타입 벡터에 상응하는 클래스를 해당 픽셀의 클래스로 지정하며, 상기 프로토타입 벡터와 상기 채널 벡터는 동일한 길이로 설정되고, 상기 비주얼 인코더와 시맨틱 인코더는 적어도 하나의 동일한 손실을 공유하여 동시에 학습되는 시맨틱 분할 장치가 제공된다. According to one aspect of the present invention, a visual encoder that receives an input image and outputs a visual feature map through neural network operation; A semantic encoder that receives feature vectors for each class and outputs prototype vectors for each class through neural network operations; A semantic division unit that specifies a class for each pixel of the visual feature map by comparing the prototype vector for each class with a channel vector for each pixel of the visual feature map, wherein the semantic division unit determines a class for each pixel of the visual feature map, wherein the semantic division unit determines the channel vector of a specific pixel and A class corresponding to a similar prototype vector is designated as the class of the corresponding pixel, the prototype vector and the channel vector are set to the same length, and the visual encoder and the semantic encoder are simultaneously learned by sharing at least one same loss. A semantic segmentation device is provided.

상기 비주얼 인코더와 시맨틱 인코더가 공유하는 손실은 프로토타입 손실을 포함하며, 상기 프로토타입 손실은 상기 시맨틱 인코더에서 출력되는 특정 클래스의프로토타입 벡터와 상기 비주얼 특징맵의 해당 클래스의 채널 벡터들의 중간값 사이의 손실에 상응한다. The loss shared by the visual encoder and the semantic encoder includes a prototype loss, and the prototype loss is between the prototype vector of a specific class output from the semantic encoder and the intermediate value of the channel vectors of the corresponding class in the visual feature map. corresponds to the loss of

상기 프로토타입 손실에 기초하여 상기 비주얼 인코더에서 출력되는 특정 클래스의 채널 벡터들의 중간값은 상기 시맨틱 인코더에서 출력하는 해당 클래스의 프로토타입 벡터와 동일해지는 방향으로 학습된다. Based on the prototype loss, the median value of channel vectors of a specific class output from the visual encoder is learned to be the same as the prototype vector of the class output from the semantic encoder.

상기 비주얼 인코더와 시맨틱 인코더가 공유하는 손실은 크로스 엔트로피 손실을 포함하며, 상기 크로스 엔트로피 손실에 의해 상기 비주얼 인코더는 동일한 클래스의 채널 벡터들은 임베딩 공간상에서 상대적으로 가까이 위치하고 다른 클래스의 채널 벡터들은 임베딩 공간상에서 상대적으로 멀리 위치하도록 학습된다. The loss shared by the visual encoder and the semantic encoder includes cross-entropy loss, and by the cross-entropy loss, the visual encoder allows channel vectors of the same class to be located relatively close in the embedding space and channel vectors of different classes to be located relatively close to each other in the embedding space. It is learned to be located relatively far away.

상기 시맨틱 인코더는 상기 시맨틱 인코더로 입력되는 클래스별 특징 벡터들 사이의 거리와 상기 시맨틱 인코더에서 출력되는 클래스별 프로토타입 벡터들 사이의 거리가 동일해지도록 시맨틱 손실을 이용하여 학습이 이루어진다. The semantic encoder is trained using semantic loss so that the distance between feature vectors for each class input to the semantic encoder and the distance between prototype vectors for each class output from the semantic encoder are the same.

상기 프로토타입 손실을 연산하기 위해 상기 시맨틱 인코더로는 상기 입력 이미지에 클래스별 특징 벡터를 적용하여 생성되는 제1 시맨틱 분할맵을 입력한다. To calculate the prototype loss, a first semantic segmentation map generated by applying feature vectors for each class to the input image is input to the semantic encoder.

상기 제1 시맨틱 분할맵을 이용한 프로토타입 손실은 다음의 수학식과 같이 연산된다. Prototype loss using the first semantic segmentation map is calculated as follows:

위 수학식에서, L_center는 프로토타입 손실, c는 클래스, S는 클래스 총 집합, p는 픽셀, Rc는 특정 클래스의 픽셀 집합, v(p)는 비주얼 특징맵에서 픽셀 위치 p에서의 채널 벡터, μ(p)는 제1 시맨틱 분할맵에서 픽셀 위치 p의 특징 벡터를 입력하여 시맨틱 인코더에서 출력되는 프로토타입 벡터이고, d()는 두 변수의 거리를 출력하는 함수임. In the above equation, L _center is the prototype loss, c is the class, S is the total set of classes, p is the pixel, Rc is the set of pixels of a specific class, v(p) is the channel vector at pixel position p in the visual feature map, μ(p) is a prototype vector output from the semantic encoder by inputting the feature vector of pixel position p in the first semantic segmentation map, and d() is a function that outputs the distance between two variables.

상기 프로토타입 손실을 연산하기 위해 상기 시맨틱 인코더로는 상기 입력 이미지의 클래스별로 특징 벡터를 적용하고, 상기 클래스별 특징 벡터가 적용된 이미지를 축소하고, 상기 축소된 이미지를 선형 보간을 통해 본래의 이미지 크기로 확대한 제2 시맨틱 분할맵이 입력된다. In order to calculate the prototype loss, the semantic encoder applies a feature vector for each class of the input image, reduces the image to which the class-specific feature vector is applied, and linearly interpolates the reduced image to the original image size. The second semantic segmentation map enlarged is input.

상기 제2 시맨틱 분할맵을 이용한 프로토타입 손실은 다음의 수학식과 같이 연산된다. Prototype loss using the second semantic segmentation map is calculated as follows:

위 수학식에서, L_bar는 프로토타입 손실, c는 클래스, S는 클래스 총 집합, p는 픽셀, Rc는 특정 클래스의 픽셀 집합, v(p)는 비주얼 특징맵에서 픽셀 위치 p에서의 채널 벡터, 는 제2 시맨틱 분할맵에서 픽셀 위치 p의 특징 벡터를 입력하여 시맨틱 인코더에서 출력되는 프로토타입 벡터이고, d()는 두 변수의 거리를 출력하는 함수임. In the above equation, L _bar is the prototype loss, c is the class, S is the total set of classes, p is the pixel, Rc is the set of pixels of a specific class, v(p) is the channel vector at pixel position p in the visual feature map, is a prototype vector output from the semantic encoder by inputting the feature vector of pixel position p in the second semantic segmentation map, and d() is a function that outputs the distance between two variables.

본 발명의 다른 측면에 따르면, 입력 이미지를 입력받아 신경망 연산을 통해 비주얼 특징맵을 비주얼 인코더를 통해 출력하는 단계(a); 클래스별 특징 벡터를 입력받아 신경망 연산을 통해 클래스별 프로토타입 벡터를 시맨틱 인코더를 통해 출력하는 단계(b); 상기 클래스별 프로토타입 벡터와 상기 비주얼 특징맵의 픽셀별 채널 벡터를 비교하여 상기 비주얼 특징맵의 픽셀 각각에 대해 클래스를 지정하는 단계(c)를 포함하되, 상기 단계 (c)는 특정 픽섹의 채널 벡터와 가장 유사한 프로토타입 벡터에 상응하는 클래스를 해당 픽셀의 클래스로 지정하며, 상기 프로토타입 벡터와 상기 채널 벡터는 동일한 길이로 설정되고, 상기 비주얼 인코더와 시맨틱 인코더는 적어도 하나의 동일한 손실을 공유하여 동시에 학습되는 시맨틱 분할 방법이 제공된다. According to another aspect of the present invention, receiving an input image and outputting a visual feature map through a visual encoder through neural network operation (a); Step (b) of receiving feature vectors for each class and outputting prototype vectors for each class through a semantic encoder through neural network operations; Comprising the step (c) of specifying a class for each pixel of the visual feature map by comparing the prototype vector for each class with the channel vector for each pixel of the visual feature map, wherein step (c) is performed on the channel of a specific pixel. The class corresponding to the prototype vector most similar to the vector is designated as the class of the corresponding pixel, the prototype vector and the channel vector are set to the same length, and the visual encoder and the semantic encoder share at least one same loss. A semantic segmentation method that is simultaneously learned is provided.

따라서, 본 발명에 의하면, 학습되지 않은 클래스에 대해 판별적 방식으로 시맨틱 분할을 수행하여 지속적인 분류기 학습을 요구하지 않으며, 학습되지 않은 클래스를 기 학습된 클래스로 분류하는 편향 문제를 저감시킬 수 있는 장점이 있다. Therefore, according to the present invention, by performing semantic segmentation on untrained classes in a discriminative manner, continuous classifier learning is not required, and the bias problem of classifying untrained classes into previously learned classes has the advantage of being reduced. There is.

도 1은 본 발명의 일 실시예에 따른 제로샷 시맨틱 분할 장치의 전체적인 구조를 나타낸 블록도.
도 2는 본 발명의 일 실시예에 따라 비주얼 인코더에서 출력되는 채널 벡터들의 임베딩 공간의 일례를 나타낸 도면.
도 3은 본 발명의 일 실시예에 따라 학습되지 않은 클래스에 대해 시맨틱 분할을 수행하는 원리를 나타낸 도면.
도 4는 본 발명의 일 실시예에 따른 비주얼 인코더 및 시맨틱 인코더의 학습 구조를 나타낸 도면.
도 5는 본 발명의 일 실시예에 따른 시맨틱 분할 장치의 학습을 위해 생성되는 제1 시맨틱 분할맵을 생성하는 원리를 나타낸 도면.
도 6은 본 발명의 일 실시예에 따른 시맨틱 분할 장치의 학습을 위해 생성되는 제2 시맨틱 맵을 생성하는 원리를 나타낸 도면.
도 7은 본 발명의 일 실시예에 따른 프로토타입 손실을 개념적으로 설명하기 위한 도면.
도 8은 본 발명의 일 실시예에 따른 시맨틱 손실을 설명하기 위한 도면.
도 9는 본 발명의 일 실시예에 따른 제로샷 시맨틱 분할 장치의 학습 방법을 나타낸 순서도.Figure 1 is a block diagram showing the overall structure of a zero-shot semantic segmentation device according to an embodiment of the present invention.
Figure 2 is a diagram showing an example of an embedding space of channel vectors output from a visual encoder according to an embodiment of the present invention.
Figure 3 is a diagram showing the principle of performing semantic segmentation on an untrained class according to an embodiment of the present invention.
Figure 4 is a diagram showing the learning structure of a visual encoder and a semantic encoder according to an embodiment of the present invention.
Figure 5 is a diagram showing the principle of generating a first semantic segmentation map generated for learning of a semantic segmentation device according to an embodiment of the present invention.
Figure 6 is a diagram showing the principle of generating a second semantic map generated for learning of a semantic segmentation device according to an embodiment of the present invention.
7 is a diagram conceptually illustrating prototype loss according to an embodiment of the present invention.
Figure 8 is a diagram for explaining semantic loss according to an embodiment of the present invention.
Figure 9 is a flowchart showing a learning method of a zero-shot semantic segmentation device according to an embodiment of the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, its operational advantages, and the objectives achieved by practicing the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by explaining preferred embodiments of the present invention with reference to the accompanying drawings. However, the present invention may be implemented in many different forms and is not limited to the described embodiments. In order to clearly explain the present invention, parts not relevant to the description are omitted, and like reference numerals in the drawings indicate like members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 “포함”한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 “...부”, “...기”, “모듈”, “블록” 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part is said to “include” a certain element, this does not mean excluding other elements, unless specifically stated to the contrary, but rather means that it may further include other elements. In addition, terms such as “… unit”, “… unit”, “module”, and “block” used in the specification refer to a unit that processes at least one function or operation, which is hardware, software, or hardware. and software.

도 1은 본 발명의 일 실시예에 따른 제로샷 시맨틱 분할 장치의 전체적인 구조를 나타낸 블록도이다. Figure 1 is a block diagram showing the overall structure of a zero-shot semantic segmentation device according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 제로샷 시맨틱 분할 장치는 비주얼 인코더(100), 시맨틱 인코더(200) 및 시맨틱 분할부(300)를 포함한다. Referring to FIG. 1, a zero-shot semantic segmentation apparatus according to an embodiment of the present invention includes a visual encoder 100, a semantic encoder 200, and a semantic segmentation unit 300.

비주얼 인코더(100)로는 시맨틱 분할을 위한 이미지(50)가 입력된다. 비주얼 인코더(100)는 신경망 연산을 통해 입력 이미지(50)에 대한 비주얼 특징맵(150)을 출력한다. 일례로, 비주얼 인코더(100)는 CNN 등과 같은 신경망 연산을 통해 비주얼 특징맵을 출력할 수 있다. 입력 이미지(50)가 H₁ X W₁의 사이즈의 이미지일 경우, 비주얼 인코더(100)를 통해 출력되는 비주얼 특징맵(150)은 H₂ X W₂ X C의 사이즈를 가질 수 있다. 여기서, H는 높이를 의미하고 W는 폭을 의미한다. 입력 이미지의 사이즈와 비주얼 특징맵의 사이즈(높이 및 폭)은 동일할 수도 있으며 다르게 설정될 수도 있을 것이다. C는 비주얼 특징맵의 각 픽셀이 가지는 채널 벡터의 길이를 의미한다. 비주얼 특징맵(150)은 각 픽셀별로 채널 벡터를 가지게 되며, 비주얼 특징맵(150)을 출력하는 비주얼 인코더(100)는 입력 이미지(50)에 포함된 객체 클래스별로 유사한 채널 벡터를 출력하도록 학습된다. An image 50 for semantic segmentation is input to the visual encoder 100. The visual encoder 100 outputs a visual feature map 150 for the input image 50 through neural network operation. For example, the visual encoder 100 may output a visual feature map through neural network operations such as CNN. When the input image 50 _is _an _image with a size of H ₁ Here, H means height and W means width. The size of the input image and the size (height and width) of the visual feature map may be the same or may be set differently. C refers to the length of the channel vector of each pixel of the visual feature map. The visual feature map 150 has a channel vector for each pixel, and the visual encoder 100 that outputs the visual feature map 150 is trained to output similar channel vectors for each object class included in the input image 50. .

도 2는 본 발명의 일 실시예에 따라 비주얼 인코더에서 출력되는 채널 벡터들의 임베딩 공간의 일례를 나타낸 도면이다. Figure 2 is a diagram showing an example of an embedding space of channel vectors output from a visual encoder according to an embodiment of the present invention.

도 2에서, 파란색 포인트들은 자동차를 나타내는 픽셀들의 채널 벡터들이며, 녹색의 포인트들은 자전거를 나타내는 픽셀들의 채널 벡터들이다. 도 2에 도시된 바와 같이, 자동차를 나타내는 픽셀들은 임베딩 공간에서 서로 인접하여 위치하도록 비주얼 인코더(100)가 학습되고, 자전거를 나타내는 픽셀들은 임베딩 공간에서 서로 인접하여 위치하도록 비주얼 인코더(100)가 학습된다. In Figure 2, blue points are channel vectors of pixels representing cars, and green points are channel vectors of pixels representing bicycles. As shown in Figure 2, the visual encoder 100 is trained so that pixels representing cars are located adjacent to each other in the embedding space, and the visual encoder 100 is trained so that pixels representing bicycles are located adjacent to each other in the embedding space. do.

한편, 시맨틱 인코더(200)로는 분할 대상 클래스의 특징 벡터가 입력된다. 예를 들어, 분할 대상 클래스가 자동차 및 자전거일 경우 시맨틱 인코더(200)로는 자동차의 특징 벡터 및 자전거의 특징 벡터가 입력된다. 분할 대상 클래스의 특징 벡터를 입력받은 시맨틱 인코더(200)는 신경망 연산을 통해 각 클래스의 특징 벡터의 상응하는 프로토타입 벡터(250)를 출력한다. 예를 들어, 자동차에 대한 특징 벡터 및 자전거에 대한 특징 벡터가 입력될 경우, 시맨틱 인코더(200)는 자전거에 대한 프로토타입 벡터 및 자동차에 대한 프로토타입 벡터를 각각 출력한다. 본 발명의 일 실시예에 따르면, 시맨틱 인코더(200)는 FC(Fully Connected) 신경망일 수 있으나 이에 한정되는 것은 아니다. Meanwhile, the feature vector of the segmentation target class is input to the semantic encoder 200. For example, when the segmentation target classes are car and bicycle, the feature vector of the car and the feature vector of the bicycle are input to the semantic encoder 200. The semantic encoder 200, which receives the feature vector of the segmentation target class, outputs a prototype vector 250 corresponding to the feature vector of each class through neural network operation. For example, when a feature vector for a car and a feature vector for a bicycle are input, the semantic encoder 200 outputs a prototype vector for a bicycle and a prototype vector for a car, respectively. According to one embodiment of the present invention, the semantic encoder 200 may be a fully connected (FC) neural network, but is not limited thereto.

여기서, 시맨틱 인코더(200)로 입력되는 특징 벡터는 상용의 데이터베이스로부터 획득할 수 있는 벡터이다. 예를 들어, 위키피디아와 같은 데이터베이스는 각 클래스별로 특징 벡터를 제공하고 있으며, 이와 같이 상업적으로 획득 가능한 특징 벡터들을 시맨틱 인코더(200)에 입력하는 것이다. Here, the feature vector input to the semantic encoder 200 is a vector that can be obtained from a commercial database. For example, databases such as Wikipedia provide feature vectors for each class, and these commercially available feature vectors are input into the semantic encoder 200.

본 발명의 일 실시예에 따르면, 시맨틱 인코더(200)로부터 출력되는 프로토타입 벡터의 길이와 비주얼 인코더(100)를 통해 출력되는 픽셀별 채널 벡터의 길이는 동일하게 설정된다. According to one embodiment of the present invention, the length of the prototype vector output from the semantic encoder 200 and the length of the pixel-specific channel vector output through the visual encoder 100 are set to be the same.

시맨틱 분할부(300)는 시맨틱 인코더(200)에서 출력되는 클래스별 프로토타입 벡터와 비주얼 인코더(100)에서 출력되는 비주얼 특징맵을 이용하여 시맨틱 분할을 수행한다. 시맨틱 분할부(300)는 비주얼 특징맵의 각 채널 벡터와 클래스별 프로토타입 벡터를 비교하여 비주얼 특징맵의 픽셀별로 어느 하나의 클래스를 지정하여 시맨틱 분할을 수행한다. 구체적으로 시맨틱 분할부(300)는 특정 픽셀의 채널 벡터와 시맨틱 인코더(200)로부터 출력된 다수의 클래스별 프로토타입 벡터와의 유사도를 연산한다. 시맨틱 분할부(300)는 해당 픽셀의 채널 벡터와 가장 높은 유사도를 가지는 특정 클래스의 프로토타입 벡터를 판단하고, 해당 프로토타입 벡터의 클래스를 해당 픽셀의 클래스를 지정한다. 이와 같은 픽셀별 클래스 지정 작업이 모든 픽셀에 대해 이루어지면 시맨틱 인코더(200)로 입력된 클래스에 대한 시맨틱 분할이 이루어진다. 예를 들어, 시맨틱 인코더(200)로 입력된 클래스가, 자전거, 자동차 및 백그라운드라고 할 경우, 각 픽셀별로 자전거, 자동차 및 백그라운드 중 어느 하나가 지정되는 것이다. The semantic segmentation unit 300 performs semantic segmentation using the prototype vector for each class output from the semantic encoder 200 and the visual feature map output from the visual encoder 100. The semantic segmentation unit 300 compares each channel vector of the visual feature map with the prototype vector for each class and performs semantic segmentation by specifying a class for each pixel of the visual feature map. Specifically, the semantic division unit 300 calculates the similarity between the channel vector of a specific pixel and a plurality of prototype vectors for each class output from the semantic encoder 200. The semantic division unit 300 determines the prototype vector of a specific class that has the highest similarity to the channel vector of the corresponding pixel, and designates the class of the prototype vector as the class of the corresponding pixel. When this pixel-specific class designation task is performed for all pixels, semantic segmentation for the class input to the semantic encoder 200 is performed. For example, if the classes input to the semantic encoder 200 are bicycle, car, and background, one of bicycle, car, and background is designated for each pixel.

결국, 시맨틱 분할부(300)는 각 픽셀의 채널 벡터와 가장 유사한 클래스별 프로토타입 벡터를 탐색하고, 탐색된 프로토타입 벡터의 클래스를 해당 픽셀의 클래스로 지정하는 작업을 통해 시맨틱 분할을 수행하는 것이다. Ultimately, the semantic division unit 300 performs semantic division by searching for a class-specific prototype vector that is most similar to the channel vector of each pixel and designating the class of the searched prototype vector as the class of the corresponding pixel. .

도 1을 참조하여 설명한 본 발명의 시맨틱 분할 장치는 비주얼 인코더(100) 및 시맨틱 인코더(200)의 학습이 완료되고 시맨틱 분할 장치를 추론 단계에서 이용할 경우의 동작을 설명한 것이다. The semantic segmentation device of the present invention described with reference to FIG. 1 explains the operation when the training of the visual encoder 100 and the semantic encoder 200 is completed and the semantic segmentation device is used in the inference step.

도 1에 도시된 시맨틱 분할 장치는 미리 학습되지 않은 클래스에 대해서도 시맨틱 분할을 가능하도록 한다. The semantic segmentation device shown in Figure 1 enables semantic segmentation even for classes that have not been learned in advance.

비주얼 인코더(100) 및 시맨틱 인코더(200)에서 미리 학습된 클래스가 자전거, 자동차 및 백그라운드라고 가정한다. 그런데, 신경망 사용자가 사람에 대해서도 시맨틱 분할을 수행할 필요가 있다고 판단할 경우, 신경망 사용자는 사람에 대한 특징 벡터를 시맨틱 인코더(200)에 입력하여 미리 학습되지 않은 클래스에 대해서도 제로샷 시맨틱 분할이 가능하다. It is assumed that the classes previously learned in the visual encoder 100 and the semantic encoder 200 are bicycle, car, and background. However, if the neural network user determines that it is necessary to perform semantic segmentation on people, the neural network user inputs the feature vector for the person into the semantic encoder 200, enabling zero-shot semantic segmentation even for classes that were not learned in advance. do.

이 경우, 신경망 사용자는 자전거, 자동차 및 사람에 대한 특징 벡터를 시맨틱 인코더(200)에 함께 입력한다. 시맨틱 인코더(200)는 미리 학습된 방식에 따라 자전거에 대한 프로토타입 벡터, 자동차에 대한 프로토타입 벡터 및 사람에 대한 프로토타입 벡터를 출력한다. In this case, the neural network user inputs feature vectors for bicycles, cars, and people together into the semantic encoder 200. The semantic encoder 200 outputs a prototype vector for a bicycle, a prototype vector for a car, and a prototype vector for a person according to a method learned in advance.

도 3은 본 발명의 일 실시예에 따라 학습되지 않은 클래스에 대해 시맨틱 분할을 수행하는 원리를 나타낸 도면이다. Figure 3 is a diagram showing the principle of performing semantic segmentation on an untrained class according to an embodiment of the present invention.

도 3을 참조하면, 비주얼 인코더(100)로는 입력 이미지(50)가 입력된다. 또한, 시맨틱 인코더(200)로는 미리 학습된 클래스인 자동차(car) 및 자전거(bicycle)에 대한 특징벡터와 미리 학습되지 않은 클래스인 사람(person)에 대한 특징 벡터가 입력된다. Referring to FIG. 3, an input image 50 is input to the visual encoder 100. Additionally, feature vectors for car and bicycle, which are pre-learned classes, and feature vectors for person, which is a class that has not been learned in advance, are input to the semantic encoder 200.

시맨틱 분할부(300)는 시맨틱 인코더(200)가 출력하는 자전거에 대한 프로토타입 벡터, 자동차에 대한 프로토타입 벡터 및 사람에 대한 프로토타입 벡터를 이용하여 시맨틱 분할을 수행한다. The semantic segmentation unit 300 performs semantic segmentation using the prototype vector for the bicycle, the prototype vector for the car, and the prototype vector for the person output by the semantic encoder 200.

도 3에 도시된 바와 같이, 임베딩 공간에서 사람에 대한 프로토타입 벡터와 유사한 채널 벡터들의 픽셀에 대해서는 사람 클래스로 분류하고, 자동차에 대한 프로토타입 벡터와 유사한 채널 벡터들의 픽셀에 대해서는 자동차 클래스로 분류하며, 자전거에 대한 프로토타입 벡터와 유사한 채널 벡터들의 픽셀에 대해서는 자전거 클래스로 분류한다. As shown in Figure 3, in the embedding space, pixels of channel vectors similar to the prototype vector for a person are classified as the person class, and pixels of channel vectors similar to the prototype vector for a car are classified as the car class. , pixels of channel vectors similar to the prototype vector for a bicycle are classified into the bicycle class.

도 4는 본 발명의 일 실시예에 따른 비주얼 인코더 및 시맨틱 인코더의 학습 구조를 나타낸 도면이다. Figure 4 is a diagram showing the learning structure of a visual encoder and a semantic encoder according to an embodiment of the present invention.

앞서 설명한 바와 같이 비주얼 인코더(100) 및 시맨틱 인코더(200)는 인공 신경망이며, 비주얼 인코더(100) 및 시맨틱 인코더(200)의 신경망 가중치는 학습을 통해 설정된다. As described above, the visual encoder 100 and the semantic encoder 200 are artificial neural networks, and the neural network weights of the visual encoder 100 and the semantic encoder 200 are set through learning.

본 발명의 학습 구조에서 주요한 특징 중 하나는 비주얼 인코더(100) 및 시맨틱 인코더(200)가 동일한 손실(loss)을 공유하면서 학습된다는 것이며, 이러한 학습 구조를 통해 시맨틱 인코더(200)에서 출력되는 프로토타입 벡터를 이용하여 비주얼 인코더에서 출력하는 특징맵에 대한 시매틱 분할을 수행하는 것이 가능해지도록 한다. One of the main features of the learning structure of the present invention is that the visual encoder 100 and the semantic encoder 200 are learned while sharing the same loss, and the prototype output from the semantic encoder 200 through this learning structure is Using vectors, it becomes possible to perform semantic segmentation on the feature map output from the visual encoder.

도 4를 참조하면, 본 발명의 일 실시예에 따른 비주얼 인코더(100) 및 시맨틱 인코더(200)는 프로토타입 손실 및 크로스엔트로피 손실을 공유하면서 함께 학습이 이루어진다. Referring to FIG. 4, the visual encoder 100 and the semantic encoder 200 according to an embodiment of the present invention are trained together while sharing prototype loss and cross-entropy loss.

프로토타입 손실은 시맨틱 인코더(200)에서 출력하는 특정 클래스의 프로토타입 벡터와 비주얼 인코더(100)에서 출력하는 특정 클래스의 채널 벡터들의 중간값 사이의 손실을 의미한다. 즉, 프로토타입 손실에 의해 비주얼 인코더(100)에서 출력하는 특정 클래스의 채널 벡터들의 중간값과 시맨틱 인코더(200)에서 출력하는 특정 클래스의 프로토타입 벡터가 동일해지는 방향으로 비주얼 인코더(100) 및 시맨틱 인코더(200)가 학습되는 것이다. Prototype loss refers to the loss between the prototype vector of a specific class output from the semantic encoder 200 and the median value of the channel vectors of the specific class output from the visual encoder 100. That is, the visual encoder 100 and the semantic encoder 100 are connected in a direction where the median value of the channel vectors of a specific class output from the visual encoder 100 and the prototype vector of a specific class output from the semantic encoder 200 become the same due to prototype loss. The encoder 200 is learned.

프로토타입 손실을 통해, 비주얼 인코더(100)의 출력인 비주얼 특징맵에서 자동차를 나타내는 채널 벡터들의 중간값과 시맨틱 인코더(200)에서 출력하는 자동차에 대한 프로토타입 벡터가 동일해지도록 비주얼 인코더(100) 및 시맨틱 인코더(200)가 학습되는 것이다. Through prototype loss, the visual encoder 100 uses the visual encoder 100 so that the median of the channel vectors representing the car in the visual feature map output from the visual encoder 100 and the prototype vector for the car output from the semantic encoder 200 are the same. And the semantic encoder 200 is learned.

크로스 엔트로피 손실은 동일한 클래스의 채널 벡터 및 프로토타입 벡터는 서로 가까워지도록하고 상이한 클래스의 채널 벡터 및 프로토타입 벡터는 서로 멀어지도록 학습하기 위한 손실이다. Cross-entropy loss is a loss for learning so that channel vectors and prototype vectors of the same class are closer to each other, and channel vectors and prototype vectors of different classes are farther away from each other.

클래스의 총 집합을 S라고하고, p를 픽셀 좌표, R을 비주얼 특징맵, c는 클래스, Rc는 비주얼 특징맵에서 특정 클래스를 나타내는 픽셀들의 채널 벡터 집합이라고 정의할 때, 크로스 엔트로피 손실은 다음의 수학식 1과 같이 연산될 수 있다. When the total set of classes is defined as S, p is the pixel coordinate, R is the visual feature map, c is the class, and Rc is the channel vector set of pixels representing a specific class in the visual feature map, the cross entropy loss is as follows: It can be calculated as in Equation 1.

위 수학식 1에서, ω_c는 특정 클래스 c의 가중치 벡터이고 ω_j는 모든 클래스의 가중치 벡터이며, v(p)는 픽셀 위치 p에서의 채널 벡터를 의미한다. 가중치 벡터들은 학습을 통해 설정될 수 있으며, 동일한 클래스의 채널 벡터들은 해당 클래스의 가중치 벡터와 큰 내적값을 가지도록 학습이 이루어지고, 다른 클래스의 가중치 벡터와는 작은 내적값을 가지도록 학습되는 것을 통해 서로 다른 클래스의 채널 벡터 및 프로토타입 벡터는 임베딩 공간에서 멀어지고 같은 클래스의 채널 벡터 및 프로토타입 벡터는 가까워지도록 학습되는 것이 가능하다. In Equation 1 above, ω _c is the weight vector of a specific class c, ω _j is the weight vector of all classes, and v(p) means the channel vector at pixel position p. Weight vectors can be set through learning, and channel vectors of the same class are trained to have a large dot product with the weight vector of that class and have a small dot product with the weight vector of another class. Through this, it is possible to learn that channel vectors and prototype vectors of different classes become distant in the embedding space, and channel vectors and prototype vectors of the same class become closer.

한편, 본 발명의 바람직한 실시예에 따르면, 프로토타입 손실을 학습하기 위해 별도의 시맨틱 분할맵을 생성하여 학습을 수행할 수 있다. 도 1을 참조하여 살펴본 바와 같이, 추론 단계에서는 시맨틱 인코더(200)로 상업적으로 입수 가능한 클래스의 특징 벡터가 입력된다. 그러나, 학습 단계에서는 클래스의 특징 벡터 자체가 아닌 시맨틱 분할맵을 이용하여 학습을 수행하는 것이 바람직하며, 시맨틱 분할맵이 시맨틱 인코더(200)로 입력되어 프로토타입 손실 및 크로스 엔트로피 손실에 의한 비주얼 인코더(100) 미 시맨틱 인코더(200)의 학습이 이루어진다. Meanwhile, according to a preferred embodiment of the present invention, learning can be performed by creating a separate semantic segmentation map to learn the prototype loss. As seen with reference to FIG. 1, in the inference step, feature vectors of commercially available classes are input to the semantic encoder 200. However, in the learning stage, it is preferable to perform learning using a semantic segmentation map rather than the class feature vector itself, and the semantic segmentation map is input to the semantic encoder 200 to generate a visual encoder ( 100) The semantic encoder 200 is trained.

도 5는 본 발명의 일 실시예에 따른 시맨틱 분할 장치의 학습을 위해 생성되는 제1 시맨틱 분할맵을 생성하는 원리를 나타낸 도면이다. Figure 5 is a diagram showing the principle of generating a first semantic segmentation map generated for learning of a semantic segmentation device according to an embodiment of the present invention.

도 5의 (a)는 입력 이미지를 나타내고, 도 5의 (b)는 입력 이미지로부터 제1 시맨틱 분할맵을 생성하는 원리를 나타낸 도면이다. Figure 5(a) shows an input image, and Figure 5(b) shows the principle of generating a first semantic segmentation map from an input image.

도 5의 (a)에 도시된 바와 같이, 입력 이미지는 테이블과 사람이 촬영된 이미지이다. 도 5(a)의 입력 이미지는 3개의 클래스로 구분되며, 테이블, 사람 및 백그라운드이다. As shown in (a) of Figure 5, the input image is an image of a table and a person. The input image in Figure 5(a) is divided into three classes: table, person, and background.

도 5의 (b)를 참조하면, 1차적으로 입력 이미지에서 각 클래스 영역을 구분한다. 입력 이미지는 클래스에 대한 정답을 이미 알고 있는 이미지이므로 이에 대한 클래스 영역 구분이 가능하다. Referring to (b) of FIG. 5, each class area is initially distinguished from the input image. Since the input image is an image for which the correct answer to the class is already known, it is possible to classify the class area.

입력 이미지에서 클래스 영역 구분이 이루어지면, 각 클래스 영역에 준비된 특징 벡터를 적용한다. 즉, 테이블에 대한 특징 벡터, 사람에 대한 특징 벡터 및 백그라운드에 대한 특징 벡터를 적용하는 것이다. 각 특징 벡터는 상업적으로 이용 가능한 데이터베이스로부터 획득한다. When class areas are distinguished in the input image, the prepared feature vector is applied to each class area. That is, the feature vector for the table, the feature vector for the person, and the feature vector for the background are applied. Each feature vector is obtained from a commercially available database.

이와 같이 각 클래스 영역의 클래스에 상응하는 특징 벡터를 입력 이미지에 적용하여 제1 시맨틱 분할맵(500)이 생성되며, 제1 시맨틱 분할맵(500)이 시맨틱 인코더에 입력되는 것이다. In this way, the first semantic segmentation map 500 is created by applying the feature vector corresponding to the class in each class area to the input image, and the first semantic segmentation map 500 is input to the semantic encoder.

시맨틱 인코더(200)는 제1 시맨틱 분할맵(500)을 입력받아, 사람에 대한 프로토타입 벡터, 테이블에 대한 프로토타입 벡터 및 백그라운드에 대한 프로토타입 벡터를 출력한다. The semantic encoder 200 receives the first semantic segmentation map 500 and outputs a prototype vector for the person, a prototype vector for the table, and a prototype vector for the background.

제1 시맨틱 분할맵(500)이 이용될 경우의 제1 프로토타입 손실(L_center)은 다음의 수학식 2와 같이 연산될 수 있다. The first prototype loss (L _center ) when the first semantic segmentation map 500 is used can be calculated as in Equation 2 below.

위 수학식 2에서, c는 클래스, S는 클래스 총 집합, p는 픽셀, Rc는 특정 클래스의 픽셀 집합, v(p)는 비주얼 특징맵에서 픽셀 위치 p에서의 채널 벡터, μ(p)는 제1 시맨틱 분할맵에서 픽셀 위치 p에서의 프로토타입 벡터를 나타낸다. 또한, d(a,b)는 a와 b 사이의 거리를 나타내는 함수로서, 위 수학식2는 v(p)의 총합과 μ(p)의 총합이 동일해지도록 학습되는 것이다. In Equation 2 above, c is a class, S is the total set of classes, p is a pixel, Rc is a set of pixels of a specific class, v(p) is the channel vector at pixel position p in the visual feature map, and μ(p) is Indicates the prototype vector at pixel position p in the first semantic segmentation map. In addition, d(a,b) is a function representing the distance between a and b, and Equation 2 above is learned so that the sum of v(p) and the sum of μ(p) are the same.

결국, 수학식 2와 같은 제1 프로토타입 손실을 통해 특정 클래스의 채널 벡터의 중간값이 해당 클래스의 프로토타입 벡터와 동일해지도록 학습이 이루어지는 것이다. Ultimately, learning is performed so that the median value of the channel vector of a specific class becomes the same as the prototype vector of the class through the first prototype loss as shown in Equation 2.

이와 같은 비주얼 인코더(100)와 채널 인코더(200)의 동시 학습을 통해 채널 인코더(200)의 프로토타입 벡터를 기준으로 비주얼 인코더(100)에서 출력되는 특징맵에 대한 시맨틱 분할이 가능한 것이다. Through this simultaneous learning of the visual encoder 100 and the channel encoder 200, semantic segmentation of the feature map output from the visual encoder 100 is possible based on the prototype vector of the channel encoder 200.

그런데, 제1 시맨틱 분할맵을 이용하여 학습을 수행할 때 편향 문제가 발생할 수 있다. 비주얼 인코더(100)를 통해 출력되는 비주얼 특징맵은 콘불루션 뉴럴 네트워크(CNN)이므로 픽셀들의 값이 연속적이다. 그런데, 시맨틱 인코더(200)에서 출력되는 값은 도 5의 (b)와 같이 이산적인(discrete) 값을 가지게 된다. 이러한 차이는 학습을 방해하고 편향 문제를 야기할 수 있게 된다. However, a bias problem may occur when learning is performed using the first semantic segmentation map. The visual feature map output through the visual encoder 100 is a convolutional neural network (CNN), so the pixel values are continuous. However, the value output from the semantic encoder 200 has a discrete value, as shown in (b) of FIG. 5. These differences can hinder learning and cause bias problems.

이에, 본 발명에서는 제2 시맨틱 분할맵을 이용하여 다른 형태의 프로토타입 손실(제2 프로토타입 손실)을 이용하여 학습하는 구조 역시 제안한다. Accordingly, the present invention also proposes a structure for learning using a different type of prototype loss (second prototype loss) using a second semantic segmentation map.

도 6은 본 발명의 일 실시예에 따른 시맨틱 분할 장치의 학습을 위해 생성되는 제2 시맨틱 맵을 생성하는 원리를 나타낸 도면이다. Figure 6 is a diagram showing the principle of generating a second semantic map generated for learning of a semantic segmentation device according to an embodiment of the present invention.

도 6에서, 입력 이미지는 도 5와 동일하며 입력 이미지를 클래스에 따라 구분하는 작업은 동일하게 이루어진다. In Figure 6, the input image is the same as in Figure 5, and the task of classifying the input image according to class is performed in the same way.

입력 이미지에서 클래스별 영역이 구분되면, 입력 이미지를 축소한 축소 이미지(600)를 생성한다. 축소 이미지가 생성되면, 축소 이미지에 각 클래스별 특징 벡터를 적용한다. 제1 시맨틱 분할맵은 축소되지 않은 이미지에 클래스별 특징 벡터를 적용하나 제2 시맨틱 분할맵을 생성할 때에는 축소 이미지(600)에 클래스별 특징 벡터를 적용하는 것이다. When regions for each class are distinguished in the input image, a reduced image 600 is generated by reducing the input image. When a reduced image is created, the feature vector for each class is applied to the reduced image. The first semantic segmentation map applies feature vectors for each class to the non-reduced image, but when generating the second semantic segmentation map, the feature vectors for each class are applied to the reduced image 600.

축소 이미지(600)에 클래스별 특징 벡터를 적용한 후에는 클래스별 특징 벡터가 적용된 축소 이미지를 선형 보간을 통해 확대하여 제2 시맨틱 분할맵(610)을 생성한다. 선형 보간을 통해 이미지가 확대되면서 경계 영역에서의 특징 벡터는 연속적인 특징을 가질 수 있게 된다. After applying the feature vector for each class to the reduced image 600, the reduced image to which the feature vector for each class is applied is enlarged through linear interpolation to generate the second semantic segmentation map 610. As the image is enlarged through linear interpolation, the feature vector in the border area can have continuous features.

이에 따라 시맨틱 인코더에서 출력하는 프로토타입 벡터들도 경계 영역에서는 연속적인 값을 가질 수 있다. 결국, 제2 시맨틱 분할맵(610)을 이용하여 학습을 수행할 때 비주얼 인코더의 특징맵과 같이 경계 영역에서 연속적인 특징을 가지므로 학습 시 발생하는 편향 문제를 해결할 수 있다. Accordingly, prototype vectors output from the semantic encoder may also have continuous values in the border area. Ultimately, when learning is performed using the second semantic segmentation map 610, the bias problem that occurs during learning can be solved because it has continuous features in the boundary area like the feature map of a visual encoder.

다음의 수학식 3은 제1 시맨틱 분할맵(610)을 이용하여 제2 프로토타입 손실(L_bar)을 계산하는 방법을 나타낸 것이다. The following equation 3 shows a method of calculating the second prototype loss (L _bar ) using the first semantic segmentation map 610.

제2 프로토타입 손실은 수학식 2의 제1 프로토타입 손실과 비교할 때 μ(p) 대신 가 사용된다는 점에서 차이가 있다. 는 제2 시맨틱 분할맵을 시맨틱 인코더에 입력하여 출력되는 프로토타입 벡터들을 의미한다. The second prototype loss is expressed as μ(p) instead of μ(p) when compared to the first prototype loss in Equation 2. There is a difference in that is used. means prototype vectors output by inputting the second semantic segmentation map to the semantic encoder.

도 7은 본 발명의 일 실시예에 따른 프로토타입 손실을 개념적으로 설명하기 위한 도면이다. Figure 7 is a diagram for conceptually explaining prototype loss according to an embodiment of the present invention.

도 7을 참조하면, 시맨틱 인코더(200)로는 자동차(car)에 대한 특징 벡터와 자전거(bicycle)에 대한 특징 벡터가 입력되며, 시맨틱 인코더(200)는 자동차에 대한 프로토타입 벡터(700) 및 자전거에 대한 프로토타입 벡터(710)를 출력한다. Referring to FIG. 7, the semantic encoder 200 inputs a feature vector for a car and a feature vector for a bicycle, and the semantic encoder 200 inputs a prototype vector 700 for a car and a bicycle. Output the prototype vector 710 for .

한편, 비주얼 인코더(100)로는 입력 이미지가 입력되며, 비주얼 인코더는 각 픽셀별로 클래스에 대한 채널 벡터를 출력한다. 클래스에 대한 채널 벡터는 임베딩 공간(720)상에 나타낼 수 있다. 이때, 자동차(car)를 나타내는 채널 벡터들과 자전거(bicycle)를 나타내는 채널 벡터들은 각각 임베딩 공간(720)상에 투영되는데, 자동차를 나타내는 채널 벡터들의 중간값은 자동차에 대한 프로토타입 벡터와 동일해지도록 프로토타입 손실이 연산되는 것이다. 자전거에 대해서도 동일한 방식으로 프로토타입 손실이 연산된다. Meanwhile, an input image is input to the visual encoder 100, and the visual encoder outputs a channel vector for the class for each pixel. Channel vectors for classes can be represented on embedding space 720. At this time, the channel vectors representing the car and the channel vectors representing the bicycle are each projected onto the embedding space 720, and the median value of the channel vectors representing the car is the same as the prototype vector for the car. The prototype loss is calculated so that Prototype loss is calculated in the same way for bicycles.

결국, 도 7에 도시된 임베딩 공간(720)은 시맨틱 인코더(200)의 프로토타입 벡터와 비주얼 인코더(100)의 채널 벡터가 함께 투영되는 공통 임베딩 공간(Joint Embedding Space)로 표현할 수 있다. Ultimately, the embedding space 720 shown in FIG. 7 can be expressed as a common embedding space (Joint Embedding Space) in which the prototype vector of the semantic encoder 200 and the channel vector of the visual encoder 100 are projected together.

한편, 다시 도 4를 참조하면, 시맨틱 인코더(200)에 대해서는 시맨틱 손실을 이용하여 추가적인 학습이 이루어진다. 시맨틱 손실은 입력되는 각 클래스의 특징 벡터들간의 거리와 시맨틱 인코더에서 출력되는 각 클래스별 특징 벡터들간의 거리의 차에 대한 손실이다. 시맨틱 인코더(200)는 시맨틱 손실을 이용하여 입력되는 클래스별 특징 벡터들간의 거리의 차이가 출력되는 클래스별 프로토타입 벡터들간의 거리차가 동일해지도록 학습을 수행한다. Meanwhile, referring again to FIG. 4, additional learning is performed on the semantic encoder 200 using semantic loss. Semantic loss is the loss of the difference between the distance between the feature vectors of each class input and the distance between the feature vectors of each class output from the semantic encoder. The semantic encoder 200 uses semantic loss to perform learning so that the distance difference between the input feature vectors for each class becomes the same as the distance difference between the output prototype vectors for each class.

예를 들어, 개와 고양이는 서로 다른 클래스이나 형상에서 유사성이 있으므로 특징 벡터들간의 거리 차가 상대적으로 작을 것이다. 그러나, 개와 사람은 형상의 유사성이 낮으므로 특징 벡터들간의 거리차가 상대적으로 클 것이다. 이러한 특징 벡터들간의 거리 차가 프로토타입 벡터들에도 반영되도록 시맨틱 손실을 이용한 학습이 이루어지는 것이다. For example, dogs and cats have similarities in different classes and shapes, so the distance difference between feature vectors will be relatively small. However, since the shape similarity between dogs and humans is low, the distance difference between feature vectors will be relatively large. Learning using semantic loss is performed so that the distance difference between feature vectors is also reflected in the prototype vectors.

도 8은 본 발명의 일 실시예에 따른 시맨틱 손실을 설명하기 위한 도면이다. Figure 8 is a diagram for explaining semantic loss according to an embodiment of the present invention.

도 8을 참조하면, 특징 벡터의 임베딩 공간(800)과 프로토타입 벡터의 임베딩 공간(810)이 도시되어 있다. 클래스로는 테이블, 고양이, 개 및 사람이 표시되어 있다. Referring to FIG. 8, the embedding space 800 of the feature vector and the embedding space 810 of the prototype vector are shown. Classes are indicated as table, cat, dog, and human.

상용의 데이터베이스로부터 획득되는 각 클래스의 특징 벡터들은 특징 벡터의 임베딩 공간(800)상에 투영되고, 각 특징 벡터들이 시맨틱 인코더(200)에 입력되어 출력되는 프로토타입 벡터들은 프로토타입 벡터의 임베딩 공간(810)에 투영된다. The feature vectors of each class obtained from a commercial database are projected onto the feature vector embedding space 800, and the prototype vectors that are input and output from each feature vector to the semantic encoder 200 are embedded in the prototype vector embedding space (800). 810).

특징 벡터의 임베딩 공간(800)에서의 두 클래스간 거리와 프로토타입 벡터의 임베딩 공간(800)에서의 두 클래스간 거리의 차를 시맨틱 손실로 정의할 수 있으며, 구체적으로 시맨틱 손실은 다음의 수학식 4와 같이 정의될 수 있다. The difference between the distance between two classes in the embedding space 800 of the feature vector and the distance between the two classes in the embedding space 800 of the prototype vector can be defined as semantic loss. Specifically, semantic loss can be defined as the following equation: It can be defined as 4.

위 수학식 4에서, i와 j는 클래스를 의미하고, S는 클래스 집합을 의미하며, r_ij는 특징 벡터의 임베딩 공간에서 i 클래스와 j 클래스간 거리를 의미하고, 는 프로토타입 벡터의 임베딩 공간에서 i 클래스와 j 클래스간 거리를 의미한다. In Equation 4 above, i and j mean classes, S means a class set, r _ij means the distance between class i and class j in the embedding space of the feature vector, means the distance between class i and class j in the embedding space of the prototype vector.

도 9는 본 발명의 일 실시예에 따른 제로샷 시맨틱 분할 장치의 학습 방법을 나타낸 순서도이다. 도 9에 도시된 학습 방법은 제2 시맨틱 분할맵을 이용하여 학습을 수행하는 경우를 나타낸 순서도이다. Figure 9 is a flowchart showing a learning method of a zero-shot semantic segmentation device according to an embodiment of the present invention. The learning method shown in FIG. 9 is a flowchart showing a case where learning is performed using the second semantic segmentation map.

도 9를 참조하면, 입력 이미지를 비주얼 인코더(100)에 입력하여 비주얼 특징맵을 생성한다(단계 900). Referring to FIG. 9, an input image is input to the visual encoder 100 to generate a visual feature map (step 900).

입력 이미지에 대해 클래스별 구분을 수행한 후 입력 이미지를 축소시킨다(단계 902). After classifying the input image by class, the input image is reduced (step 902).

축소된 입력 이미지에 대해 각 클래스에 상응하는 특징 벡터를 적용한다(단계 904). The feature vector corresponding to each class is applied to the reduced input image (step 904).

특징 벡터가 적용된 축소된 이미지를 선형 보간을 통해 확대시켜 원래의 사이즈로 복원시켜 제2 시맨틱 분할맵을 생성한다(단계 906). The reduced image to which the feature vector is applied is enlarged through linear interpolation and restored to its original size to generate a second semantic segmentation map (step 906).

제2 시맨틱 분할맵을 시맨틱 인코더(200)에 입력하여 클래스별 프로토타입 벡터를 출력한다(단계 908). The second semantic segmentation map is input to the semantic encoder 200 to output a prototype vector for each class (step 908).

비주얼 특징맵의 클래스별 채널 벡터들과 시맨틱 인코더(200)에서 출력되는 클래스별 프로토타입 벡터를 이용하여 제2 프로토타입 손실을 연산한다(단계 910). 제2 프로토타입 손실은 수학식 3과 같이 연산될 수 있다. The second prototype loss is calculated using the channel vectors for each class of the visual feature map and the prototype vector for each class output from the semantic encoder 200 (step 910). The second prototype loss can be calculated as Equation 3.

한편, 비주얼 특징맵에서 출력되는 채널 벡터들을 이용하여 크로스 엔트로피 손실을 연산한다(단계 912). 크로스 엔트로피 손실은 수학식 1과 같이 연산될 수 있다. 앞서 설명한 바와 같이, 크로스 엔트로피 손실은 같은 클래스의 채널 벡터들은 임베딩 공간에서 가까워지고 다른 클래스의 채널 벡터들은 임베딩 공간에서 멀어지도록 학습하기 위해 사용된다. Meanwhile, cross entropy loss is calculated using channel vectors output from the visual feature map (step 912). Cross entropy loss can be calculated as Equation 1. As explained earlier, cross entropy loss is used to learn channel vectors of the same class to become closer in the embedding space and channel vectors of different classes to be farther away in the embedding space.

시맨틱 인코더에 입력되는 클래스별 특징 벡터들간 거리 및 시맨틱 인코더에서 출력되는 클래스별 프로토타입 벡터들간 거리를 이용하여 시맨틱 손실을 연산한다(단계 914). 시맨틱 손실은 수학식 4와 같이 연산될 수 있다. Semantic loss is calculated using the distance between feature vectors for each class input to the semantic encoder and the distance between prototype vectors for each class output from the semantic encoder (step 914). Semantic loss can be calculated as Equation 4.

제2 프로토타입 손실, 크로스 엔트로피 손실을 이용하여 비주얼 인코더(100) 및 시맨틱 인코더(200)를 동시에 학습시킨다(단계 916). 제2 프토로타입 손실과 크로스 엔트로피 손실을 이용하여 비주얼 인코더(100)와 시맨틱 인코더(200)의 가중치를 동시에 갱신하는 것이다. 한편, 시맨틱 인코더(200)는 시맨틱 손실도 함께 반영하여 가중치 갱신이 이루어진다. The visual encoder 100 and the semantic encoder 200 are simultaneously trained using the second prototype loss and cross entropy loss (step 916). The weights of the visual encoder 100 and the semantic encoder 200 are simultaneously updated using the second prototype loss and cross entropy loss. Meanwhile, the semantic encoder 200 updates weights by also reflecting semantic loss.

본 발명에 따른 방법은 컴퓨터에서 실행시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention can be implemented as a computer program stored on a medium for execution on a computer. Here, computer-readable media may be any available media that can be accessed by a computer, and may also include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data, including read-only memory (ROM). It may include dedicated memory), RAM (random access memory), CD (compact disk)-ROM, DVD (digital video disk)-ROM, magnetic tape, floppy disk, optical data storage, etc.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.The present invention has been described with reference to the embodiments shown in the drawings, but these are merely exemplary, and those skilled in the art will understand that various modifications and other equivalent embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the attached claims.

Claims

A visual encoder that receives an input image and outputs a visual feature map through neural network operations;
A semantic encoder that receives feature vectors for each class and outputs prototype vectors for each class through neural network operations;
A semantic division unit that compares the prototype vector for each class and the channel vector for each pixel of the visual feature map to specify a class for each pixel of the visual feature map,
The semantic division unit specifies the class corresponding to the prototype vector most similar to the channel vector of a specific pixel as the class of the corresponding pixel,
The prototype vector and the channel vector are set to the same length, and the visual encoder and semantic encoder are learned simultaneously by sharing at least one same loss,
The loss shared by the visual encoder and the semantic encoder includes a prototype loss, and the prototype loss is between the prototype vector of a specific class output from the semantic encoder and the intermediate value of the channel vectors of the corresponding class in the visual feature map. Corresponding to the loss of
A semantic segmentation device characterized in that, in order to calculate the prototype loss, the semantic encoder inputs a first semantic segmentation map generated by applying a feature vector for each class to the input image.

delete

According to paragraph 1,
Semantic segmentation device, characterized in that based on the prototype loss, the median value of channel vectors of a specific class output from the visual encoder is learned to be the same as the prototype vector of the corresponding class output from the semantic encoder.

According to paragraph 1,
The loss shared by the visual encoder and the semantic encoder includes cross-entropy loss,
Semantic segmentation device, characterized in that the visual encoder is trained so that channel vectors of the same class are located relatively close in the embedding space and channel vectors of different classes are located relatively far in the embedding space due to the cross-entropy loss.

According to paragraph 1,
The semantic encoder is characterized in that learning is performed using semantic loss so that the distance between feature vectors for each class input to the semantic encoder and the distance between prototype vectors for each class output from the semantic encoder are the same. Semantic segmentation device.

delete

According to paragraph 1,
Semantic segmentation device, characterized in that the prototype loss is calculated according to the following equation.

In the above equation, L _center is the prototype loss, c is the class, S is the total set of classes, p is the pixel, Rc is the set of pixels of a specific class, v(p) is the channel vector at pixel position p in the visual feature map, μ(p) is a prototype vector output from the semantic encoder by inputting the feature vector of pixel position p in the first semantic segmentation map, and d() is a function that outputs the distance between two variables.

According to paragraph 3,
In order to calculate the prototype loss, the semantic encoder applies a feature vector for each class of the input image, reduces the image to which the class-specific feature vector is applied, and linearly interpolates the reduced image to the original image size. A semantic segmentation device characterized in that an enlarged second semantic segmentation map is input.

According to clause 8,
Semantic segmentation device, characterized in that the prototype loss is calculated according to the following equation.

In the above equation, L _bar is the prototype loss, c is the class, S is the total set of classes, p is the pixel, Rc is the set of pixels of a specific class, v(p) is the channel vector at pixel position p in the visual feature map, is a prototype vector output from the semantic encoder by inputting the feature vector of pixel position p in the second semantic segmentation map, and d() is a function that outputs the distance between two variables.

Step (a) of receiving an input image and outputting a visual feature map through a visual encoder through neural network calculation;
Step (b) of receiving a feature vector for each class and outputting a prototype vector for each class through a semantic encoder through a neural network operation;
Comprising the step (c) of specifying a class for each pixel of the visual feature map by comparing the prototype vector for each class and the channel vector for each pixel of the visual feature map,
In step (c), the class corresponding to the prototype vector most similar to the channel vector of a specific pixel is designated as the class of the pixel,
The prototype vector and the channel vector are set to the same length, and the visual encoder and semantic encoder are learned simultaneously by sharing at least one same loss,
The loss shared by the visual encoder and the semantic encoder includes a prototype loss, and the prototype loss is between the prototype vector of a specific class output from the semantic encoder and the intermediate value of the channel vectors of the corresponding class in the visual feature map. corresponds to the loss of,
A semantic segmentation method characterized by inputting a first semantic segmentation map generated by applying a feature vector for each class to the input image to the semantic encoder to calculate the prototype loss.

delete

According to clause 10,
A semantic segmentation method, characterized in that based on the prototype loss, the median value of channel vectors of a specific class output from the visual encoder is learned to be the same as the prototype vector of the class output from the semantic encoder.

According to clause 10,
The loss shared by the visual encoder and the semantic encoder includes cross-entropy loss,
A semantic segmentation method, characterized in that the visual encoder is trained so that channel vectors of the same class are located relatively close in the embedding space and channel vectors of different classes are located relatively far in the embedding space by the cross-entropy loss.

According to clause 10,
The semantic encoder is characterized in that learning is performed using semantic loss so that the distance between feature vectors for each class input to the semantic encoder and the distance between prototype vectors for each class output from the semantic encoder are the same. Semantic segmentation method.

delete

According to clause 10,
A semantic segmentation method, characterized in that the prototype loss is calculated as the following equation.

In the above equation, L _center is the prototype loss, c is the class, S is the total set of classes, p is the pixel, Rc is the set of pixels of a specific class, v(p) is the channel vector at pixel position p in the visual feature map, μ(p) is a prototype vector output from the semantic encoder by inputting the feature vector of pixel position p in the first semantic segmentation map, and d() is a function that outputs the distance between two variables.

According to clause 12,
In order to calculate the prototype loss, the semantic encoder applies a class-specific feature vector to the input image, reduces the image to which the class-specific feature vector is applied, and linearly interpolates the reduced image to the original image size. A semantic segmentation method characterized in that an enlarged second semantic segmentation map is input.

According to clause 17,
A semantic segmentation method, characterized in that the prototype loss is calculated as the following equation.

In the above equation, L _bar is the prototype loss, c is the class, S is the total set of classes, p is the pixel, Rc is the set of pixels of a specific class, v(p) is the channel vector at pixel position p in the visual feature map, is a prototype vector output from the semantic encoder by inputting the feature vector of pixel position p in the second semantic segmentation map, and d() is a function that outputs the distance between two variables.