KR20220064857A

KR20220064857A - Segmentation method and segmentation device

Info

Publication number: KR20220064857A
Application number: KR1020200180688A
Authority: KR
Inventors: 김지혜; 예 레이몬드; 알렉스 슈빙; 팅 위안
Original assignee: 삼성전자주식회사; 더 보드 오브 트러스티즈 오브 더 유니버시티 오브 일리노이
Priority date: 2020-11-12
Filing date: 2020-12-22
Publication date: 2022-05-19

Abstract

Disclosed are a segmentation method and a segmentation device. The segmentation method includes the following steps of: receiving image frames including a current frame and at least one adjacent frame adjacent to the current frame; calculating a characteristic map integrating the image frames, based on time information between the current frame and the adjacent frame; extracting characteristics of an area of interest corresponding to each instance included in the current frame from the characteristic frame; predicting an object class corresponding to the area of interest, based on the characteristics of the area of interest; and repetitively correcting an a-modal mask predicted in response to the object class, based on the characteristics of the area of interest, thereby segmenting the instances.

Description

Segmentation method and segmentation device

아래의 실시예들은 세그먼테이션 방법 및 세그먼테이션 장치에 관한 것이다.The following embodiments relate to a segmentation method and a segmentation apparatus.

인간의 지각은 객체가 완전하게 보이지 않고 다른 객체에 의해 가려져 일부만 보이더라도 해당 객체를 자연스럽게 추론할 수 있다. 예를 들어, 인간은 고속 도로에서 차선을 바꿀 때 백미러(rearview mirror)를 통해 뒤따라는 차량을 부분적으로 바라보는 것만으로도 가려진 부분을 무의식적으로 완성하여 상황을 파악할 수 있다. 단일 영상에서 보이는 객체뿐만 아니라, 더 나아가 일부가 가려진 객체 또한 함께 분할하는 에이-모달 세그먼테이션(amodal segmentation)은 장면 이해(Scene understanding) 관점에서 사물에 대한 이해도뿐만 아니라, 다음 프레임에 대한 예측 정확도를 향상시킬 수 있다.Human perception can naturally infer the object even if the object is not completely invisible but only partially visible because it is obscured by other objects. For example, when changing lanes on a highway, humans can understand the situation by subconsciously completing the obscured part just by partially looking at the following vehicle through a rearview mirror. Amodal segmentation, in which not only visible objects in a single image, but also partially occluded objects are segmented together, improves not only understanding of objects from the viewpoint of scene understanding but also prediction accuracy for the next frame can do it

일 실시예 따른 세그먼테이션 방법은 현재 프레임 및 상기 현재 프레임에 인접한 적어도 하나의 인접 프레임을 포함하는 영상 프레임들을 수신하는 단계; 상기 현재 프레임 및 상기 인접 프레임 간의 시간 정보를 기초로, 상기 영상 프레임들을 통합(aggregation)하는 특징 맵(feature map)을 산출하는 단계; 상기 특징 맵으로부터 상기 현재 프레임에 포함된 인스턴스들(instances) 각각에 대응하는 관심 영역(region of interest; ROI)의 특징을 추출하는 단계; 상기 관심 영역의 특징을 기초로, 상기 관심 영역에 대응하는 객체 클래스(class)를 예측하는 단계; 및 상기 관심 영역의 특징을 기초로, 상기 객체 클래스에 대응하여 예측된 에이-모달 마스크(amodal mask)를 반복적으로 보정함으로써 상기 인스턴스들을 세그먼테이션하는 단계를 포함한다. A segmentation method according to an embodiment includes: receiving image frames including a current frame and at least one adjacent frame adjacent to the current frame; calculating a feature map for aggregating the image frames based on time information between the current frame and the adjacent frame; extracting a feature of a region of interest (ROI) corresponding to each of the instances included in the current frame from the feature map; predicting an object class corresponding to the region of interest based on the characteristics of the region of interest; and segmenting the instances by iteratively correcting an a-modal mask predicted corresponding to the object class based on the characteristic of the region of interest.

상기 인스턴스들을 세그먼테이션하는 단계는 상기 관심 영역의 특징을 기초로 상기 객체 클래스에 대응하는 에이-모달 마스크를 예측하는 단계; 및 상기 예측된 에이-모달 마스크를 상기 관심 영역의 특징에 적용하는 동작을 반복함으로써, 상기 인스턴스들을 세그먼테이션하는 단계를 포함할 수 있다. Segmenting the instances may include predicting an a-modal mask corresponding to the object class based on a characteristic of the region of interest; and repeating the operation of applying the predicted a-modal mask to the feature of the region of interest, thereby segmenting the instances.

상기 에이 모달 마스크를 예측하는 단계는 상기 객체 클래스에 대응하는 타겟 인스턴스(target instance)의 보이는 영역(visible area)으로부터 가려진 영역(occluded area)으로 상기 관심 영역의 특징을 전파하여 상기 타겟 인스턴스에 대응하는 상기 에이-모달 마스크를 예측하는 단계를 포함할 수 있다. The predicting of the modal mask may include propagating a feature of the region of interest from a visible area to an occluded area of a target instance corresponding to the object class to correspond to the target instance. predicting the a-modal mask.

상기 에이-모달 마스크를 예측하는 단계는 컨볼루션 레이어들(convolution layers)을 통해 상기 보이는 영역에 대응하는 특징을 전달하여 수용 필드(receptive field)를 상기 가려진 영역으로 확장함으로써 상기 관심 영역의 특징을 공간적으로 전파하는 단계; 디컨볼루션 레이어들(deconvolution layers)에 의해 상기 관심 영역의 특징의 공간 차원(spatial dimension)을 확장하는 단계; 및 상기 확장된 공간 차원에서 상기 타겟 인스턴스에 대응하는 상기 에이-모달 마스크를 예측하는 단계를 포함할 수 있다. In the predicting of the a-modal mask, a feature of the region of interest is spatially expanded by transferring a feature corresponding to the visible region through convolution layers to extend a receptive field to the occluded region. to propagate; extending the spatial dimension of the feature of the region of interest by deconvolution layers; and predicting the a-modal mask corresponding to the target instance in the extended spatial dimension.

상기 인스턴스들을 세그먼테이션하는 단계는 상기 객체 클래스에 대응하는 타겟 인스턴스의 보이는 영역으로부터 상기 타겟 인스턴스의 가려진 영역으로 상기 관심 영역의 특징을 공간적으로 전파하여 상기 에이-모달 마스크를 반복적으로 예측하는 단계; 상기 관심 영역의 특징을 기초로, 상기 보이는 영역에 대응하는 모달 마스크(modal mask)를 예측하는 단계; 상기 관심 영역의 특징을 기초로, 상기 가려진 영역에 대응하는 오클루젼 마스크(occlusion mask)를 예측하는 단계; 및 상기 에이-모달 마스크, 상기 모달 마스크, 및 상기 오클루젼 마스크의 조합에 기초하여, 상기 인스턴스들을 세그먼테이션하는 단계를 포함할 수 있다. Segmenting the instances may include: iteratively predicting the a-modal mask by spatially propagating a feature of the region of interest from a visible region of a target instance corresponding to the object class to an obscured region of the target instance; predicting a modal mask corresponding to the visible region based on the characteristic of the region of interest; predicting an occlusion mask corresponding to the occluded region based on the characteristics of the ROI; and segmenting the instances based on a combination of the a-modal mask, the modal mask, and the occlusion mask.

상기 에이-모달 마스크, 상기 모달 마스크, 및 상기 오클루젼 마스크의 조합에 기초하여, 상기 인스턴스들을 세그먼테이션하는 단계는 상기 모달 마스크의 픽셀 별 확률에 대응하는 제1 신뢰도를 산출하는 단계; 상기 오클루젼 마스크의 픽셀 별 확률에 대응하는 제2 신뢰도를 산출하는 단계; 상기 제1 신뢰도 및 상기 제2 신뢰도 중 적어도 하나에 기초한 신뢰도 맵에 의해 상기 에이-모달 마스크를 가중화하는 단계; 및 상기 가중화된 에이-모달 마스크에 의해 상기 인스턴스들을 세그먼테이션하는 단계를 포함할 수 있다. Based on the combination of the a-modal mask, the modal mask, and the occlusion mask, the segmenting of the instances may include: calculating a first reliability corresponding to a pixel-by-pixel probability of the modal mask; calculating a second reliability corresponding to the pixel-by-pixel probability of the occlusion mask; weighting the a-modal mask by a confidence map based on at least one of the first confidence level and the second confidence level; and segmenting the instances by the weighted a-modal mask.

상기 인스턴스들을 세그먼테이션하는 단계는 상기 관심 영역의 특징을 기초로, 상기 객체 클래스에 대응하는 초기 집중 마스크(initial attention mask)를 예측하는 단계; 상기 초기 집중 마스크에서 상기 객체 클래스에 대응하는 초기 마스크를 추출하는 단계; 상기 초기 마스크에 상기 관심 영역의 특징을 반복적으로 적용하여 상기 에이-모달 마스크를 생성하는 단계; 및 상기 에이-모달 마스크에 의해 상기 인스턴스들을 세그먼테이션하는 단계를 포함할 수 있다. Segmenting the instances may include predicting an initial attention mask corresponding to the object class based on the characteristics of the region of interest; extracting an initial mask corresponding to the object class from the initial concentration mask; generating the a-modal mask by repeatedly applying the characteristics of the region of interest to the initial mask; and segmenting the instances by the a-modal mask.

상기 에이-모달 마스크를 생성하는 단계는 상기 초기 마스크를 상기 관심 영역의 특징에 적용하여 제1 마스킹하는 단계; 상기 제1 마스킹된 특징을 기초로, 상기 객체 클래스에 대응하는 집중 마스크를 예측하는 단계; 상기 집중 마스크에 상기 관심 영역의 특징을 적용하여 제2 마스킹하는 단계; 및 상기 제2 마스킹된 특징을 기초로, 상기 에이-모달 마스크를 생성하는 단계를 포함할 수 있다. The generating of the a-modal mask may include performing first masking by applying the initial mask to a feature of the region of interest; predicting a concentration mask corresponding to the object class based on the first masked feature; performing second masking by applying a characteristic of the region of interest to the concentration mask; and generating the a-modal mask based on the second masked feature.

상기 관심 영역의 특징을 추출하는 단계는 영역 제안 네트워크(region proposal network; RPN)에 의해 상기 특징 맵으로부터 상기 관심 영역의 특징을 추출하는 단계를 포함할 수 있다. Extracting the feature of the region of interest may include extracting the feature of the region of interest from the feature map by a region proposal network (RPN).

상기 관심 영역의 특징을 추출하는 단계는 상기 인스턴스들 중 상기 현재 프레임에서 가려진 영역을 포함하는 어느 하나의 인스턴스를 선택하는 단계; 및 상기 영역 제안 네트워크에 의해 상기 특징 맵에서 상기 선택된 인스턴스에 대응하는 관심 영역의 특징을 추출하는 단계를 포함할 수 있다. The extracting of the feature of the region of interest may include: selecting any one instance including an area covered by the current frame from among the instances; and extracting the feature of the region of interest corresponding to the selected instance from the feature map by the region proposal network.

상기 선택된 인스턴스에 대응하는 관심 영역의 특징을 추출하는 단계는 비-최대 억제(non-maximum suppression; NMS) 기법에 기반한 캐스케이드(cascade) 구조를 이용하여 상기 인스턴스들 각각에 대응하는 바운딩 박스들-상기 바운딩 박스들은 상기 인스턴스들의 위치(location) 및 상기 인스턴스들에 대응하는 객체성 점수(objectness score)를 포함함-을 필터링하는 단계; 및 상기 필터링된 바운딩 박스들에 기초하여 상기 특징 맵에서 상기 선택된 인스턴스에 대응하는 관심 영역의 특징을 추출하는 단계를 포함할 수 있다. The step of extracting the feature of the region of interest corresponding to the selected instance includes using a cascade structure based on a non-maximum suppression (NMS) technique to include bounding boxes corresponding to each of the instances. filtering the bounding boxes comprising the location of the instances and an objectness score corresponding to the instances; and extracting a feature of the region of interest corresponding to the selected instance from the feature map based on the filtered bounding boxes.

상기 객체 클래스를 예측하는 단계는 상기 관심 영역의 특징을 기초로, 상기 인스턴스들 각각에 대응하는 바운딩 박스들(bounding boxes)에 의해 상기 관심 영역에 포함된 모든 클래스들의 스코어들을 산출하는 단계; 및 상기 모든 객체 클래스들의 스코어들을 기초로, 상기 객체 클래스를 예측하는 단계를 포함할 수 있다. The predicting of the object class may include calculating scores of all classes included in the ROI by bounding boxes corresponding to each of the instances, based on the characteristics of the ROI; and predicting the object class based on the scores of all the object classes.

상기 특징 맵을 산출하는 단계는 상기 영상 프레임들에 대응하는 특징들을 추출하는 단계; 상기 현재 프레임과 상기 인접 프레임 간의 와핑(warping)에 의해 상기 특징들을 상기 현재 프레임과 공간적으로 정렬하는 단계; 및 상기 정렬된 특징에 대한 통합을 통해 상기 특징 맵을 산출하는 단계를 포함할 수 있다. Calculating the feature map may include extracting features corresponding to the image frames; spatially aligning the features with the current frame by warping between the current frame and the adjacent frame; and calculating the feature map through integration of the sorted features.

상기 특징들을 상기 현재 프레임과 공간적으로 정렬하는 단계는 상기 현재 프레임 및 상기 인접 프레임 간의 픽셀 레벨(pixel level)의 움직임을 나타내는 광학 흐름(optical flow)을 추정하는 단계; 및 상기 광학 흐름을 기초로, 상기 와핑을 통해 상기 특징들을 상기 현재 프레임과 공간적으로 정렬하는 단계를 포함할 수 있다. The spatially aligning the features with the current frame may include: estimating an optical flow representing a pixel level movement between the current frame and the adjacent frame; and spatially aligning the features with the current frame through the warping, based on the optical flow.

일 실시예에 따르면, 세그먼테이션 장치는 현재 프레임 및 상기 현재 프레임에 인접한 적어도 하나의 인접 프레임을 포함하는 영상 프레임들을 수신하는 통신 인터페이스; 및 상기 현재 프레임 및 상기 인접 프레임 간의 시간 정보를 기초로, 상기 영상 프레임들을 통합하는 특징 맵을 산출하고, 상기 특징 맵으로부터 상기 현재 프레임에 포함된 인스턴스들 각각에 대응하는 관심 영역의 특징을 추출하고, 상기 관심 영역의 특징을 기초로, 상기 관심 영역에 대응하는 객체 클래스를 예측하며, 상기 관심 영역의 특징을 기초로, 상기 객체 클래스에 대응하여 예측된 에이-모달 마스크를 반복적으로 보정함으로써 상기 인스턴스들을 세그먼테이션하는 프로세서를 포함한다. According to an embodiment, a segmentation apparatus includes: a communication interface for receiving image frames including a current frame and at least one adjacent frame adjacent to the current frame; and calculating a feature map integrating the image frames based on time information between the current frame and the adjacent frame, and extracting a feature of a region of interest corresponding to each of the instances included in the current frame from the feature map. , predicting an object class corresponding to the region of interest based on the characteristic of the region of interest, and repeatedly correcting an a-modal mask predicted corresponding to the object class based on the characteristic of the region of interest to the instance and a processor for segmenting them.

상기 프로세서는 상기 관심 영역의 특징을 기초로 상기 객체 클래스에 대응하는 에이-모달 마스크를 예측하고, 상기 예측된 에이-모달 마스크를 상기 관심 영역의 특징에 적용하는 동작을 반복함으로써, 상기 인스턴스들을 세그먼테이션할 수 있다. The processor segments the instances by repeating an operation of predicting an a-modal mask corresponding to the object class based on the characteristic of the region of interest, and applying the predicted a-modal mask to the characteristic of the region of interest. can do.

상기 프로세서는 상기 객체 클래스에 대응하는 타겟 인스턴스의 보이는 영역으로부터 가려진 영역으로 상기 관심 영역의 특징을 전파하여 상기 타겟 인스턴스에 대응하는 상기 에이-모달 마스크를 예측할 수 있다. The processor may predict the a-modal mask corresponding to the target instance by propagating the feature of the region of interest from a visible region to an obscured region of the target instance corresponding to the object class.

상기 프로세서는 상기 객체 클래스에 대응하는 타겟 인스턴스의 보이는 영역으로부터 상기 타겟 인스턴스의 가려진 영역으로 상기 관심 영역의 특징을 공간적으로 전파하여 상기 에이-모달 마스크를 반복적으로 예측하고, 상기 관심 영역의 특징을 기초로, 상기 보이는 영역에 대응하는 모달 마스크를 예측하고, 상기 관심 영역의 특징을 기초로, 상기 가려진 영역에 대응하는 오클루젼 마스크를 예측하며, 상기 에이-모달 마스크, 상기 모달 마스크, 및 상기 오클루젼 마스크의 조합에 기초하여, 상기 인스턴스들을 세그먼테이션할 수 있다. The processor spatially propagates a feature of the region of interest from a visible region of a target instance corresponding to the object class to an obscured region of the target instance to iteratively predict the a-modal mask, and based on the feature of the region of interest predicting a modal mask corresponding to the visible region, predicting an occlusion mask corresponding to the occluded region based on the characteristic of the region of interest, and predicting the a-modal mask, the modal mask, and the occlusion mask. Based on the combination of closure masks, the instances can be segmented.

상기 프로세서는 상기 관심 영역의 특징을 기초로, 상기 객체 클래스에 대응하는 초기 집중 마스크를 예측하고, 상기 초기 집중 마스크에서 상기 객체 클래스에 대응하는 초기 마스크를 추출하고, 상기 초기 마스크에 상기 관심 영역의 특징을 반복적으로 적용하여 상기 에이-모달 마스크를 생성하며, 상기 에이-모달 마스크에 의해 상기 인스턴스들을 세그먼테이션할 수 있다.The processor predicts an initial concentration mask corresponding to the object class based on the characteristic of the region of interest, extracts an initial mask corresponding to the object class from the initial concentration mask, and adds the initial mask to the region of interest. A feature may be repeatedly applied to generate the a-modal mask, and the instances may be segmented by the a-modal mask.

도 1은 일 실시예에 따른 세그먼테이션 방법을 나타낸 흐름도.
도 2는 일 실시예에 따른 세그먼테이션 장치의 프레임 워크를 도시한 도면.
도 3은 일 실시예에 따른 와핑(warping)을 설명하기 위한 도면.
도 4는 일 실시예에 따라 특징 맵을 산출하는 과정을 설명하기 위한 도면.
도 5는 일 실시예에 따른 에이-모달 세그먼테이션(amodal segmentation)을 위한 바운딩 박스와 모달 세그먼테이션(modal segmentation)을 위한 바운딩 박스에 대한 IoU(Intersection Over Union) 히스토그램을 도시한 도면.
도 6은 일 실시예에 따라 수용 필드를 확장하여 관심 영역의 특징을 공간적으로 전파하는 방법을 설명하기 위한 도면.
도 7은 일 실시예에 따른 객체 클래스에 대응하여 예측된 에이-모달 마스크를 반복적으로 보정하는 방법을 설명하기 위한 도면.
도 8은 일 실시예에 따른 세그먼테이션 장치의 블록도.1 is a flowchart illustrating a segmentation method according to an embodiment;
2 is a diagram illustrating a framework of a segmentation apparatus according to an embodiment;
3 is a diagram for explaining warping according to an embodiment;
4 is a view for explaining a process of calculating a feature map according to an embodiment;
5 is a diagram illustrating an Intersection Over Union (IoU) histogram for a bounding box for a-modal segmentation and a bounding box for modal segmentation according to an embodiment;
6 is a view for explaining a method of spatially propagating a feature of an ROI by expanding a receptive field, according to an embodiment;
7 is a view for explaining a method of iteratively correcting an a-modal mask predicted corresponding to an object class according to an embodiment;
8 is a block diagram of a segmentation apparatus according to an embodiment;

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다. 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 명세서의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for purposes of illustration only, and may be changed and implemented in various forms. Accordingly, the actual implementation form is not limited to the specific embodiments disclosed, and the scope of the present specification includes changes, equivalents, or substitutes included in the technical spirit described in the embodiments.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Although terms such as first or second may be used to describe various elements, these terms should be interpreted only for the purpose of distinguishing one element from another. For example, a first component may be termed a second component, and similarly, a second component may also be termed a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.When a component is referred to as being “connected to” another component, it may be directly connected or connected to the other component, but it should be understood that another component may exist in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, and includes one or more other features or numbers, It should be understood that the possibility of the presence or addition of steps, operations, components, parts or combinations thereof is not precluded in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present specification. does not

이하, 실시예들을 첨부된 도면들을 참조하여 상세하게 설명한다. 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고, 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. In the description with reference to the accompanying drawings, the same components are assigned the same reference numerals regardless of the reference numerals, and overlapping descriptions thereof will be omitted.

도 1은 일 실시예에 따른 세그먼테이션 방법을 나타낸 흐름도이다. 도 1을 참조하면, 일 실시예에 따른 세그먼테이션 장치가 단계(110) 내지 단계(150)를 통해 현재 프레임에 포함된 인스턴스들(instances)을 세그먼테이션 (segmentation)하는 과정이 도시된다.1 is a flowchart illustrating a segmentation method according to an embodiment. Referring to FIG. 1 , a process of segmenting instances included in a current frame through steps 110 to 150 by a segmentation apparatus according to an embodiment is illustrated.

단계(110)에서, 세그먼테이션 장치는 현재 프레임 및 현재 프레임에 인접한 적어도 하나의 인접 프레임을 포함하는 영상 프레임들을 수신한다. 적어도 하나의 인접 프레임은 현재 프레임에 인접한 이전 프레임(previous frame)일 수도 있고, 현재 프레임에 인접한 다음 프레임(next frame)일 수도 있다. 예를 들어, 현재 프레임이 t 시점의 영상 프레임인 경우, 인접 프레임은 t-1, 또는 t-2 시점의 영상 프레임일 수도 있고, t+1 시점의 영상 프레임일 수도 있다. 영상 프레임들은 예를 들어, 얼굴, 건물, 자동차, 사람 등과 같은 다양한 객체들을 포함할 수 있으며, 반드시 이에 한정되지는 않는다. 영상 프레임들은 세그먼테이션 장치의 외부로부터 수신한 것일 수도 있고, 세그먼테이션 장치가 이미지 센서 등을 통해 캡쳐(capture)한 것일 수도 있다. In operation 110 , the segmentation apparatus receives image frames including a current frame and at least one adjacent frame adjacent to the current frame. The at least one adjacent frame may be a previous frame adjacent to the current frame or a next frame adjacent to the current frame. For example, when the current frame is an image frame at time t, the adjacent frame may be an image frame at time t-1 or t-2, or may be an image frame at time t+1. The image frames may include, for example, various objects such as a face, a building, a car, a person, and the like, but is not necessarily limited thereto. The image frames may be received from the outside of the segmentation apparatus, or may be captured by the segmentation apparatus through an image sensor or the like.

단계(120)에서, 세그먼테이션 장치는 현재 프레임 및 인접 프레임 간의 시간 정보를 기초로, 영상 프레임들을 통합(aggregation)하는 특징 맵(feature map)을 산출한다. 시간 정보는 현재 프레임과 인접 프레임 간의 시간 차이로 인해 발생하는 영상들 간의 차이로 이해될 수 있다. 시간 정보는 예를 들어, 광학 흐름(optical flow)에 기초한 것일 수 있으며, 반드시 이에 한정되지는 않는다. 특징 맵은 예를 들어, 후술하는

에 해당할 수 있다. 세그먼테이션 장치는 시간 정보를 더 잘 추출하기 위해 다양한 흐름 기반 방법들(flow based methods)이 통합된 다양한 형태의 딥넷(deep-nets)을 이용하여 시간 정보를 획득할 수 있다. 예를 들어, 광학 흐름(optical flow)은 비디오 동작 인식을 위한 추가 딥 넷의 입력으로 사용될 수 있다. In operation 120 , the segmentation apparatus calculates a feature map for aggregating image frames based on time information between the current frame and the adjacent frame. Time information may be understood as a difference between images generated due to a time difference between a current frame and an adjacent frame. The time information may be, for example, based on an optical flow, but is not necessarily limited thereto. The feature map is, for example, described later

may correspond to The segmentation apparatus may acquire time information by using various types of deep-nets in which various flow based methods are integrated in order to better extract time information. For example, an optical flow can be used as an input to an additional deep net for video motion recognition.

단계(120)에서, 세그먼테이션 장치는 영상 프레임들에 대응하는 특징들을 추출할 수 있다. 세그먼테이션 장치는 영상 프레임들로부터 특징들을 추출하도록 트레이닝된 신경망 또는 인코더에 의해 특징들을 추출할 수 있다. 세그먼테이션 장치는 현재 프레임과 인접 프레임 간의 와핑(warping)에 의해 특징들을 현재 프레임과 공간적으로 정렬할 수 있다. 세그먼테이션 장치는 영상이나 특징을 와핑하여 현재 관심 프레임에 정렬함으로써 광학 흐름을 통합할 수 있다. 세그먼테이션 장치가 와핑을 수행하는 방법은 아래의 도 3을 참조하여 보다 구체적으로 설명한다. 세그먼테이션 장치는 와핑에 의해 현재 프레임과 공간적으로 정렬된 특징에 대한 통합을 통해 특징 맵을 산출할 수 있다. 세그먼테이션 장치가 특징 맵을 산출하는 과정은 아래의 4를 참조하여 보다 구체적으로 설명한다. In operation 120, the segmentation apparatus may extract features corresponding to the image frames. The segmentation apparatus may extract features by an encoder or a neural network trained to extract features from image frames. The segmentation device may spatially align features with the current frame by warping between the current frame and the adjacent frame. The segmentation device can integrate the optical flow by warping the image or feature and aligning it to the current frame of interest. A method of performing the warping by the segmentation device will be described in more detail with reference to FIG. 3 below. The segmentation apparatus may calculate a feature map through integration of features spatially aligned with the current frame by warping. A process in which the segmentation apparatus calculates the feature map will be described in more detail with reference to 4 below.

단계(130)에서, 세그먼테이션 장치는 특징 맵으로부터 현재 프레임에 포함된 인스턴스들(instances) 각각에 대응하는 관심 영역(region of interest; ROI)의 특징을 추출한다. 세그먼테이션 장치는 예를 들어, 영역 제안 네트워크(region proposal network; RPN)에 의해 특징 맵으로부터 관심 영역의 특징을 추출할 수 있다. 영역 제안 네트워크는 다양한 크기의 영상(들)을 입력받아 직사각형의 영역 제안들(region proposals)을 출력할 수 있다. 영역 제안 네트워크는 영역 제안들에 대응하는 후보 바운딩 박스들의 집합(set of candidate bounding boxes)과 해당 영역 제안들에 대응하는 객체성 점수(abjectness score)를 검색할 수 있다. 영역 제안 네트워크를 통해 출력되는 객체 각각은 객체성 점수를 나타낼 수 있다. 영역 제안 네트워크는 예를 들어, FCN(fully convolutional network)로 구성될 수 있다. 영역 제안 네트워크는 입력된 데이터의 그리드(grid)에서 각 위치마다 영역 바운드(또는 바운딩 박스)와 객체성 점수(또는 ROI pooling)를 동시에 구함으로써 같은 특징 맵을 공유할 수 있다. 영역 제안 네트워크는 영역 제안들을 생성하기 위해 연산을 공유하는 마지막 컨볼루션 레이어(conv layer)에서 나온 컨볼루션 특징 맵(conv feature map)에 작은 네트워크(small network)를 슬라이딩 시킬 수 있다. 작은 네트워크는 컨볼루션 특징 맵을 입력으로 하는 n x n 크기의 공간 윈도우(spatial window)를 포함할 수 있다. 슬라이딩 윈도우(sliding window)는 중간 레이어들(intermediate layer)를 통해 저차원의 특징(lower-dimensional feature)으로 매핑될 수 있다. In operation 130, the segmentation apparatus extracts a feature of a region of interest (ROI) corresponding to each of the instances included in the current frame from the feature map. The segmentation apparatus may extract the feature of the ROI from the feature map by, for example, a region proposal network (RPN). The region proposal network may receive image(s) of various sizes and output rectangular region proposals. The region proposal network may search a set of candidate bounding boxes corresponding to the region proposals and an objectness score corresponding to the region proposals. Each object output through the area proposal network may represent an objectivity score. The area proposal network may be configured as, for example, a fully convolutional network (FCN). The region proposal network can share the same feature map by simultaneously calculating the region bound (or bounding box) and objectivity score (or ROI pooling) for each position in the grid of input data. A region proposal network may slide a small network onto a convolutional feature map from the last convolutional layer that shares operations to generate region proposals. A small network may include a spatial window of size n x n with a convolutional feature map as input. The sliding window may be mapped to a lower-dimensional feature through intermediate layers.

단계(130)에서, 세그먼테이션 장치는 현재 프레임에 포함된 인스턴스들 중 현재 프레임에서 가려진 영역(occluded area)을 포함하는 어느 하나의 인스턴스를 선택할 수 있다. 세그먼테이션 장치는 영역 제안 네트워크에 의해 특징 맵에서 선택된 인스턴스에 대응하는 관심 영역의 특징을 추출할 수 있다. 세그먼테이션 장치는 예를 들어, 비-최대 억제(non-maximum suppression; NMS) 기법에 기반한 캐스케이드(cascade) 구조를 이용하여 인스턴스들 각각에 대응하는 바운딩 박스들(bounding boxes)을 필터링(filtering)할 수 있다. 이때, 바운딩 박스들은 해당 인스턴스가 무엇에 해당하는지를 나타내는 식별 정보, 위치(location) 및 인스턴스들에 대응하는 객체성 점수(objectness score) 중 적어도 하나를 포함할 수 있다. 여기서, 비-최대 억제(NMS) 기법은 예를 들어, 먼저 IoU(intersection of union) 값으로 제안들(proposals)을 모두 정렬한 후, ROI 점수가 가장 높은 제안과 다른 제안들에 대해서 오버랩핑을 비교하여 오버랩핑이 높은 것이 특정 임계치 이상이면 제거하는 과정을 반복함으로써 동일한 타겟에 대해 중복 출력되는 다수의 결과를 삭제할 수 있다. 여기서, IoU는 오버랩된 영역(area of overlap)을 전체 영역(area of union)으로 나눈 결과에 해당할 수 있다. 세그먼테이션 장치는 필터링된 바운딩 박스들에 기초하여 특징 맵에서 선택된 인스턴스에 대응하는 관심 영역의 특징을 추출할 수 있다. 다시 말해, 세그먼테이션 장치는 필터링을 통해 최종적으로 남은 바운딩 박스에 의해 특징 맵에서 선택된 인스턴스에 대응하는 관심 영역의 특징을 추출할 수 있다.In operation 130 , the segmentation apparatus may select any one instance including an occluded area in the current frame from among instances included in the current frame. The segmentation apparatus may extract a feature of the region of interest corresponding to the instance selected from the feature map by the region proposal network. The segmentation device may filter bounding boxes corresponding to each of the instances using, for example, a cascade structure based on a non-maximum suppression (NMS) technique. there is. In this case, the bounding boxes may include at least one of identification information indicating what a corresponding instance corresponds to, a location, and an objectness score corresponding to the instances. Here, the non-maximum suppression (NMS) technique, for example, first aligns all proposals with an IoU (intersection of union) value, and then overlaps the proposal with the highest ROI score and other proposals. In comparison, if a high overlapping value is greater than or equal to a specific threshold, a plurality of duplicated output results for the same target may be deleted by repeating the removal process. Here, IoU may correspond to the result of dividing the overlapping area (area of overlap) by the entire area (area of union). The segmentation apparatus may extract a feature of the ROI corresponding to the selected instance from the feature map based on the filtered bounding boxes. In other words, the segmentation apparatus may extract the feature of the ROI corresponding to the instance selected from the feature map by the last remaining bounding box through filtering.

단계(140)에서, 세그먼테이션 장치는 단계(130)에서 추출한 관심 영역의 특징을 기초로, 관심 영역에 대응하는 객체 클래스(class)를 예측한다. 세그먼테이션 장치는 관심 영역의 특징을 기초로, 인스턴스들 각각에 대응하는 바운딩 박스들에 의해 관심 영역에 포함된 모든 객체 클래스들의 스코어들을 산출할 수 있다. 세그먼테이션 장치는 모든 객체 클래스들의 스코어들을 기초로, 객체 클래스를 예측할 수 있다. 여기서, 객체 클래스의 스코어들은 객체 레벨의 특징들에 대응하는 클래스 정보, 다시 말해, 해당 객체가 해당 클래스가 될 확률을 나타내는 값일 수 있다. 세그먼테이션 장치는 모든 객체 클래스들의 스코어들 중 가중 높은 스코어를 갖는 객체 클래스를 해당 관심 영역에 대응하는 객체 클래스로 예측할 수 있다. In operation 140 , the segmentation apparatus predicts an object class corresponding to the region of interest based on the characteristics of the region of interest extracted in operation 130 . The segmentation apparatus may calculate scores of all object classes included in the ROI by bounding boxes corresponding to each of the instances, based on the characteristic of the ROI. The segmentation apparatus may predict an object class based on scores of all object classes. Here, the scores of the object class may be class information corresponding to characteristics of the object level, that is, a value indicating a probability that the corresponding object becomes the corresponding class. The segmentation apparatus may predict an object class having a weighted high score among scores of all object classes as an object class corresponding to the ROI.

단계(150)에서, 세그먼테이션 장치는 단계(130)에서 추출한 관심 영역의 특징을 기초로, 객체 클래스에 대응하여 예측된 에이-모달 마스크(amodal mask)를 반복적으로 보정함으로써 인스턴스들을 세그먼테이션한다. 여기서, 세그먼테이션은 예를 들어, Mask-RCNN(Regions with CNN) 기반의 인스턴스 세그먼테이션(instance segmentation)에 해당할 수 있다. 인스턴스 세그먼테이션은 관심 영역에 해당하는 모든 종류의 인스턴스들 각각을 구분하여 세그먼테이션하는 것으로 이해될 수 있다. In operation 150 , the segmentation apparatus segments instances by iteratively correcting an a-modal mask predicted corresponding to an object class based on the feature of the region of interest extracted in operation 130 . Here, the segmentation may correspond to, for example, Mask-RCNN (Regions with CNN) based instance segmentation. Instance segmentation may be understood as segmenting each of all types of instances corresponding to the region of interest.

단계(150)에서, 세그먼테이션 장치는 관심 영역의 특징을 기초로 객체 클래스에 대응하는 에이-모달 마스크를 예측할 수 있다. 여기서, '에이-모달 마스크'는 보이는 영역 및 가려진 영역을 함께 포함하는 인스턴스의 세그먼테이션 마스크에 해당할 수 있다. 에이-모달 마스크는 예를 들어, 인스턴스에 해당하는 픽셀 위치에 '1'이 표시되고, 인스턴스에 해당하지 않는 픽셀 위치에 '0'이 표시되는 바이너리 마스크(binary mask)에 해당할 수 있다. In operation 150, the segmentation apparatus may predict the a-modal mask corresponding to the object class based on the characteristic of the ROI. Here, the 'a-modal mask' may correspond to a segmentation mask of an instance including both the visible area and the hidden area. The a-modal mask may correspond to, for example, a binary mask in which '1' is displayed at a pixel position corresponding to an instance and '0' is displayed at a pixel position not corresponding to an instance.

세그먼테이션 장치는 객체 클래스에 대응하는 타겟 인스턴스(target instance)의 보이는 영역(visible area)으로부터 가려진 영역으로 관심 영역의 특징을 전파하여 타겟 인스턴스에 대응하는 에이-모달 마스크를 예측할 수 있다. The segmentation apparatus may predict the a-modal mask corresponding to the target instance by propagating the feature of the ROI from a visible area of a target instance corresponding to the object class to a hidden area.

예를 들어, 세그먼테이션 장치는 컨볼루션 레이어들(convolution layers)을 통해 보이는 영역에 대응하는 특징을 전달하여 수용 필드(receptive field)를 가려진 영역으로 확장함으로써 관심 영역의 특징을 공간적으로 전파할 수 있다. 세그먼테이션 장치가 관심 영역의 특징을 공간적으로 전파하는 방법은 아래의 도 6을 참조하여 구체적으로 설명한다. 세그먼테이션 장치는 디컨볼루션 레이어들(deconvolution layers)에 의해 관심 영역의 특징의 공간 차원(spatial dimension)을 예를 들어, 아래의 도 7에서

로 표시된 28 x 28로 확장할 수 있다. 세그먼테이션 장치는 확장된 공간 차원에서 타겟 인스턴스에 대응하는 에이-모달 마스크를 예측할 수 있다. 세그먼테이션 장치는 예측된 에이-모달 마스크를 관심 영역의 특징에 적용하는 동작을 반복함으로써, 인스턴스들을 세그먼테이션할 수 있다. 세그먼테이션 장치가 에이-모달 마스크를 예측하고, 예측된 에이-모달 마스크를 관심 영역의 특징에 적용하는 동작을 반복함으로써, 인스턴스들을 세그먼테이션하는 과정은 아래의 도 7을 참조하여 보다 구체적으로 설명한다. For example, the segmentation apparatus may spatially propagate the feature of the ROI by transferring the feature corresponding to the visible region through convolution layers and extending the receptive field to the occluded region. A method of spatially propagating the feature of the ROI by the segmentation apparatus will be described in detail with reference to FIG. 6 below. The segmentation apparatus calculates the spatial dimension of the feature of the region of interest by deconvolution layers, for example, in FIG. 7 below.

It can be expanded to 28 x 28 marked with . The segmentation apparatus may predict the a-modal mask corresponding to the target instance in the extended spatial dimension. The segmentation apparatus may segment instances by repeating an operation of applying the predicted a-modal mask to the feature of the ROI. A process of segmenting instances by repeating the operation of the segmentation apparatus predicting the a-modal mask and applying the predicted a-modal mask to the feature of the region of interest will be described in more detail with reference to FIG. 7 below.

또는 실시예에 따라서, 세그먼테이션 장치는 단계(150)에서 다음과 같은 과정을 통해 인스턴스들을 세그먼테이션할 수 있다. Alternatively, according to an embodiment, the segmentation apparatus may segment instances through the following process in step 150 .

세그먼테이션 장치는 객체 클래스에 대응하는 타겟 인스턴스의 보이는 영역으로부터 타겟 인스턴스의 가려진 영역으로 관심 영역의 특징을 공간적으로 전파하여 상기 에이-모달 마스크를 반복적으로 예측할 수 있다. 세그먼테이션 장치는 관심 영역의 특징을 기초로, 보이는 영역에 대응하는 모달 마스크(modal mask)를 예측할 수 있다. 세그먼테이션 장치는 관심 영역의 특징을 기초로, 가려진 영역에 대응하는 오클루젼 마스크(occlusion mask)를 예측할 수 있다. 세그먼테이션 장치는 에이-모달 마스크, 모달 마스크, 및 오클루젼 마스크의 조합에 기초하여, 인스턴스들을 세그먼테이션할 수 있다. 세그먼테이션 장치는 예를 들어, 모달 마스크의 픽셀 별 확률에 대응하는 제1 신뢰도를 산출할 수 있다. 세그먼테이션 장치는 오클루젼 마스크의 픽셀 별 확률에 대응하는 제2 신뢰도를 산출할 수 있다. 세그먼테이션 장치는 제1 신뢰도 및 제2 신뢰도 중 적어도 하나에 기초한 신뢰도 맵에 의해 에이-모달 마스크를 가중화할 수 있다. 가중화를 위해 사용되는 제1 신뢰도 및 제2 신뢰도는 예를 들어, 아래와 같은 두 가지 형태로 활용될 수 있다.The segmentation apparatus may iteratively predict the a-modal mask by spatially propagating the feature of the ROI from the visible region of the target instance corresponding to the object class to the obscured region of the target instance. The segmentation apparatus may predict a modal mask corresponding to the visible region based on the characteristic of the ROI. The segmentation apparatus may predict an occlusion mask corresponding to the occluded region based on the characteristic of the ROI. The segmentation device may segment the instances based on a combination of an a-modal mask, a modal mask, and an occlusion mask. The segmentation apparatus may calculate, for example, the first reliability corresponding to the probability of each pixel of the modal mask. The segmentation apparatus may calculate the second reliability corresponding to the probability of each pixel of the occlusion mask. The segmentation apparatus may weight the a-modal mask by the confidence map based on at least one of the first confidence level and the second reliability level. The first reliability and the second reliability used for weighting may be utilized, for example, in the following two forms.

세그먼테이션 장치는 제1 신뢰도 및 제2 신뢰도 중 적어도 하나를 에이-모달 마스크 결과에 곱함으로써 에이-모달 마스크를 가중화할 수 있다. The segmentation apparatus may weight the a-modal mask by multiplying the a-modal mask result by at least one of the first reliability and the second reliability.

또는 세그먼테이션 장치는 제1 신뢰도 및 제2 신뢰도 중 적어도 하나를 에이-모달 마스크 예측 시에 사용되는 "큰 수용 필드(receptive field)와 자기 집중(self- attention)을 갖는 반복적인 마스크 헤드(iterative mask-head)"의 추가적인 입력으로 활용하여 에이-모달 마스크를 가중화할 수 있다. Alternatively, the segmentation device determines at least one of the first confidence level and the second reliability level as an iterative mask-head having a “large receptive field and self-attention” used in the a-modal mask prediction. head)" can be used as an additional input to weight the a-modal mask.

세그먼테이션 장치는 전술한 방식들에 의해 가중화된 에이-모달 마스크에 의해 인스턴스들을 세그먼테이션할 수 있다.The segmentation apparatus may segment instances by an a-modal mask weighted by the above-described methods.

도 2는 일 실시예에 따른 세그먼테이션 장치의 프레임워크(frame work)를 도시한 도면이다. 도 2를 참조하면, 일 실시예에 따라 입력된 영상 프레임들(205)로부터 에이 모달 마스크(260), 모달 마스크(270) 및 오클루젼 마스크(280)를 예측하고, 에이 모달 마스크(260), 모달 마스크(270) 및 오클루젼 마스크(280)의 조합에 기초하여, 인스턴스들을 세그먼테이션하는 세그먼테이션 장치의 프레임워크가 도시된다. 2 is a diagram illustrating a framework of a segmentation apparatus according to an embodiment. Referring to FIG. 2 , an a-modal mask 260 , a modal mask 270 , and an occlusion mask 280 are predicted from input image frames 205 , and an a-modal mask 260 is predicted according to an embodiment. , a framework of a segmentation device that segments instances based on a combination of modal mask 270 and occlusion mask 280 is shown.

세그먼테이션 장치는 인스턴스 레벨의 비디오 객체를 세그먼테이션하는 장치로서, 비디오 데이터에서 객체와 그 가려진 부분을 묘사할 수 있다. The segmentation apparatus is an apparatus for segmenting an instance-level video object, and may describe an object and its hidden portion in video data.

예를 들어, 일련의 영상 프레임들 (

)(205)이 입력되면, 세그먼테이션 장치는 현재 프레임

의 모든 객체(또는 모든 인스턴스)에 대한 에이-모달 마스크(

)(260)를 예측할 수 있다. 여기서, 에이-모달 마스크(

)(260)는 프레임

에서 인스턴스

에 대한 에이-모달(마스크이고,

는 현재 프레임

에서 감지된 인스턴스 집합에 해당할 수 있다. 이하, 용어 '객체'와 '인스턴스'는 서로 혼용될 수 있다. For example, a series of image frames (

) (205) is input, the segmentation device

a-modal mask for all objects (or all instances) of

) (260) can be predicted. Here, the a-modal mask (

) 260 is the frame

instance from

for a-modal (mask,

is the current frame

It may correspond to a set of instances detected in . Hereinafter, the terms 'object' and 'instance' may be used interchangeably.

세그먼테이션 장치는

의 모든 영상 프레임들에 대한 특징을 추출할 수 있다. 세그먼테이션 장치는 예를 들어, 레즈넷(ResNet)과 같은 백본 네트워크(backbone network)에 의해 모든 영상 프레임들에 대한 특징을 추출할 수 있다. the segmentation device

It is possible to extract features for all image frames of . The segmentation apparatus may extract features of all image frames by a backbone network such as, for example, ResNet.

세그먼테이션 장치는 광학 흐름(optical flow)

를 추정(210)하고, 플로우 와핑(flow warping)(220)을 통해 이전 프레임에 대응하는 특징

을 현재 프레임

에 대응하는 특징

과 공간적으로 정렬할 수 있다. The segmentation device is an optical flow

Estimate 210, and features corresponding to the previous frame through flow warping 220

to the current frame

features corresponding to

and spatially aligned.

세그먼테이션 장치는 정렬된 특징에 대한 통합(230)를 수행하여 특징 맵

(240)을 산출할 수 있다. 세그먼테이션 장치는 예를 들어, 2D 컨볼루션 및/또는 3D 컨볼루션에 의해 프레임들에 걸쳐 특징을 통합할 수 있다. The segmentation device performs aggregation 230 on the aligned features to map the feature

(240) can be calculated. The segmentation apparatus may incorporate features across frames, for example by 2D convolution and/or 3D convolution.

세그먼테이션 장치는 예를 들어, 공간 및 시간 통합(spatial and temporal aggregation)를 수행하여 특징 맵

(240)을 산출할 수 있다. 이때, 특징 맵

(240)은 현재 프레임 및 인접 프레임 간의 시간 정보(temporal information)를 포함할 수 있다. The segmentation device performs, for example, spatial and temporal aggregation to map a feature map.

(240) can be calculated. In this case, the feature map

240 may include temporal information between the current frame and an adjacent frame.

세그먼테이션 장치는 전술한 과정을 통해 영상 프레임 전반에 걸쳐 시간 정보를 활용할 수 있다. 세그먼테이션 장치는 에이-모달을 예측하는 백본(backbone)에 시간 정보를 통합함으로써 현재 프레임 및 인접 프레임을 기반으로 가려진 영역을 추론할 수 있다.The segmentation apparatus may utilize time information throughout the image frame through the above-described process. The segmentation apparatus may infer the occluded region based on the current frame and the adjacent frame by integrating temporal information into a backbone that predicts an a-modal.

세그먼테이션 장치는 특징 맵

(240)를 감지하고, 특징 맵

(240) 중 인스턴스들 각각에 대응하는 관심 영역(245)으로부터 타겟 인스턴스에 대응하는 관심 영역의 특징

(250)을 추출할 수 있다. 세그먼테이션 장치는 예를 들어, 영역 제안 네트워크(RPN)에 의해 특징 맵으로부터 관심 영역의 특징

(250)을 추출할 수 있다. 이때, 관심 영역의 특징

(250)은 각 인스턴스

에 대한 객체 레벨의 특징들(object-level features)에 해당할 수 있다. The segmentation device is a feature map

Detect 240, feature map

Characteristics of the region of interest corresponding to the target instance from the region of interest 245 corresponding to each of the instances of 240 .

(250) can be extracted. The segmentation device may feature a region of interest from a feature map, for example by means of a region proposal network (RPN).

(250) can be extracted. At this time, the characteristics of the region of interest

250 is each instance

may correspond to object-level features for .

관심 영역(245)에 대응하는 후보 바운딩 박스가 주어지면, 세그먼테이션 장치는 ROI-Align를 이용하여 해당 후보 바운딩 박스에 대한 특징 맵을 추출할 수 있다. 바운딩 박스의 특징들은 이후에 박스 헤드(boxhead) 및/또는 마스크 헤드(mask-head)에 의해 처리되며, 이는 각각 바운딩 박스(bounding box) 및 세그먼테이션 마스크(segmentation mask)로 회귀될 수 있다. 박스 헤드는 완전히 연결된 레이어들(fully connected layers)로 구성될 수 있다. 세그먼테이션 장치는 박스 헤드에 의해 예를 들어, 분류 예측(classification prediction), 바운딩 박스(bounding box)의 크기(size) 및 위치(location)를 산출할 수 있다. 마스크 헤드는 예를 들어, 28 x 28 클래스 별 마스크(class-specific mask)를 생성하는 컨벌루션 레이어들(convolutional layers)의 스택(stack)으로 구성될 수 있다. When a candidate bounding box corresponding to the ROI 245 is given, the segmentation apparatus may extract a feature map for the corresponding candidate bounding box by using ROI-Align. The properties of the bounding box are then processed by a boxhead and/or a mask-head, which may be returned to a bounding box and a segmentation mask, respectively. The box head may consist of fully connected layers. The segmentation apparatus may calculate, for example, a classification prediction, a size and a location of a bounding box by the box head. The mask head may be composed of, for example, a stack of convolutional layers generating a 28 x 28 class-specific mask.

세그먼테이션 장치는 소프트-비-최대 억제(soft-non-maximum suppression) 기법에 기반한 캐스케이드된 박스 헤드(cascaded box-head)에 의해 에이-모달 마스크의 세그먼테이션을 위한 바운딩 박스가 중복되지 않도록 할 수 있다. The segmentation apparatus may prevent the bounding box for segmentation of the a-modal mask from overlapping by a cascaded box-head based on a soft-non-maximum suppression technique.

에이-모달 인스턴스 세그먼테이션을 위한 마스크('에이-모달 마스크)에 대응하는 바운딩 박스는 모달 인스턴스 세그먼테이션을 위한 마스크('모달 마스크')에 대응하는 바운딩 박스보다 훨씬 더 많이 겹칠 수 있다. 세그먼테이션 장치는 관심 영역의 특징을 더 광범위하게 전파함으로써 바운딩 박스가 겹치지 않도록 할 수 있다. 세그먼테이션 장치는 인스턴스들 각각의 정보를 가려진 영역까지 더 잘 전파하기 위해 큰 수용 필드(large receptive field)와 자기-집중(self-attention)을 가진 마스크 헤드를 이용할 수 있다. The bounding box corresponding to the mask for a-modal instance segmentation ('a-modal mask) may overlap much more than the bounding box corresponding to the mask for modal instance segmentation ('modal mask'). The segmentation apparatus may prevent the bounding boxes from overlapping by propagating the features of the region of interest more widely. The segmentation apparatus may use a mask head with a large receptive field and self-attention to better propagate the information of each of the instances to the occluded area.

세그먼테이션 장치는 관심 영역의 특징

(250)을 기초로, 해당 관심 영역에 대응하는 객체 클래스(class)(

)(255)를 예측할 수 있다. 박스 헤드(box-head)는 비-최대 억제(nonmaximum suppression) 동안 소프트 임계 값(soft-thresholding)을 사용하여 겹치는 바운딩 박스들을 처리할 수 있다. The segmentation device is characterized by the region of interest

Based on (250), the object class (class) corresponding to the region of interest (

)(255) can be predicted. The box-head may handle overlapping bounding boxes using soft-thresholding during nonmaximum suppression.

관심 영역의 특징

(250)이 주어지면, 세그먼테이션 장치는 아래의 표 1에 기재된 반복적인 마스크 헤드를 사용하여 각 객체에 대한 에이-모달 마스크

(260)를 예측할 수 있다. 이때, 세그먼테이션 장치는 큰 수용 필드(receptive field)와 자기 집중(self- attention)을 갖는 반복적인 마스크 헤드(iterative mask-head)를 이용할 수 있다. 이로 인해 관심 영역의 특징은 마스크 예측(mask prediction) 중에 어디로든 전파될 수 있다. 세그먼테이션 장치는 객체 클래스(

)(255)에 대응하는 타겟 인스턴스의 보이는 영역으로부터 타겟 인스턴스의 가려진 영역으로 관심 영역의 특징을 공간적으로 전파하여 에이-모달 마스크

(260)를 반복적으로 예측할 수 있다. Characteristics of the region of interest

Given 250, the segmentation device performs an a-modal mask for each object using the iterative mask head listed in Table 1 below.

(260) can be predicted. In this case, the segmentation apparatus may use an iterative mask-head having a large receptive field and self-attention. This allows the features of the region of interest to propagate anywhere during mask prediction. The segmentation device is an object class (

A-modal mask by spatially propagating the feature of the region of interest from the visible region of the target instance corresponding to ) (255) to the obscured region of the target instance.

(260) can be predicted iteratively.

세그먼테이션 장치는 관심 영역의 특징

(250)을 기초로, 보이는 영역에 대응하는 모달 마스크

(270)를 예측할 수 있다. 세그먼테이션 장치는 관심 영역의 특징

(250)을 기초로, 가려진 영역에 대응하는 오클루젼 마스크

(280)를 예측할 수 있다. The segmentation device is characterized by the region of interest

Based on 250, the modal mask corresponding to the visible area

(270) can be predicted. The segmentation device is characterized by the region of interest

Based on 250, the occlusion mask corresponding to the occluded area

(280) can be predicted.

세그먼테이션 장치는 에이-모달 마스크

(260), 모달 마스크

(270), 및 오클루젼 마스크

(280)의 조합에 기초하여, 인스턴스들을 세그먼테이션할 수 있다. 세그먼테이션 장치는 예를 들어, 모달 마스크

(270)의 픽셀 별 확률에 대응하는 제1 신뢰도(또는 객체성 스코어)를 산출할 수 있다. 세그먼테이션 장치는 오클루젼 마스크

(280)의 픽셀 별 확률에 대응하는 제2 신뢰도(또는 객체성 스코어)를 산출할 수 있다. 세그먼테이션 장치는 제1 신뢰도 및 제2 신뢰도 중 적어도 하나에 기초한 신뢰도 맵에 의해 에이-모달 마스크

(260)를 가중화할 수 있다. 세그먼테이션 장치는 가중화된 에이-모달 마스크에 의해 인스턴스들을 세그먼테이션할 수 있다.The segmentation device is an a-modal mask

(260), modal mask

270, and an occlusion mask

Based on the combination of 280 , the instances may be segmented. The segmentation device is, for example, a modal mask

A first reliability (or objectivity score) corresponding to the pixel-specific probability of 270 may be calculated. The segmentation device is an occlusion mask

A second reliability (or objectivity score) corresponding to the pixel-specific probability of 280 may be calculated. The segmentation device is configured to mask the a-modal by a confidence map based on at least one of a first confidence level and a second confidence level.

(260) can be weighted. The segmentation device may segment instances by a weighted a-modal mask.

일 실시예에 따른 세그먼테이션 장치는 모달 마스크

(270)와 오클루젼 마스크

(280)를 에이-모달 마스크

(260)와 함께 공동으로 트레이닝할 수 있다. 세그먼테이션 장치는 예를 들어, 표준 Mask-RCNN 마스크 헤드를 사용하여 각 마스크들을 트레이닝할 수 있다. 세그먼테이션 장치는 예를 들어, 아래의 수학식 1과 같이 음의 로그 가능성(negative log-likelihoods)을 합산할 수 있다. A segmentation apparatus according to an embodiment is a modal mask

270 and occlusion mask

280 a-modal mask

You can train jointly with 260. The segmentation apparatus may train each mask using, for example, a standard Mask-RCNN mask head. The segmentation apparatus may sum up negative log-likelihoods, for example, as in Equation 1 below.

여기서,

,

및

각각은 에이-모달 마스크

(260), 모달 마스크

(270), 및 오클루젼 마스크

(280)의 분포에 해당할 수 있다. 또한, '*'는 각 마스크에 대응하는 실측 마스크(ground-truth masks)를 나타낼 수 있다.here,

,

and

Each is an a-modal mask

(260), modal mask

270, and an occlusion mask

(280). Also, '*' may indicate ground-truth masks corresponding to each mask.

각 마스크에서 픽셀들의 위치는 독립적이고 베르누이(Bernoulli) 분포를 따르도록 모델링될 수 있다. 전체 손실 함수는 트레이닝 세트에 대해 합산되며 Mask-RCNN에 의한 분류 및 박스 회귀(box regression)에 의한 손실을 포함할 수 있다. 세그먼테이션 장치는 에이-모달 마스크

(260)의 바운딩 박스 분류에 레이블 스무딩(label-smoothing)을 적용함으로써 정확성을 향상시킬 수 있다. The positions of pixels in each mask are independent and can be modeled to follow a Bernoulli distribution. The overall loss function is summed over the training set and may include classification by Mask-RCNN and loss by box regression. The segmentation device is an a-modal mask

Accuracy can be improved by applying label-smoothing to the bounding box classification of (260).

도 3은 일 실시예에 따른 와핑을 설명하기 위한 도면이다. 도 3의 (a)를 참조하면, 모션으로 인해 두 영상 프레임 상의 특징들이 현재 프레임에 정렬되지 못한 결과가 도시된다. 또한, 도 3의 (b)를 참조하면, 흐름 정렬을 수행함으로써 두 영상 프레임 상의 특징들을 현재 프레임과 공간적으로 정렬한 결과가 도시된다. 도 3의 (a) 및 (b)에서 두 영상 프레임들에 걸쳐 검은 색 빗금으로 표시된 부분은 서로 동일한 객체 또는 서로 동일한 인스턴스에 해당할 수 있다. 3 is a diagram for explaining warping according to an embodiment. Referring to FIG. 3A , a result that features on two image frames are not aligned with the current frame due to motion is illustrated. Also, referring to FIG. 3B , a result of spatially aligning features on two image frames with a current frame by performing flow alignment is shown. In FIGS. 3A and 3B , portions indicated by black slashes across the two image frames may correspond to the same object or the same instance.

세그먼테이션 장치는 에이-모달 마스크 예측을 위한 시간 정보를 캡처하기 위해 시간에 따라 특징을 통합하는 시간 백본(temporal backbone)을 이용할 수 있다. 시간 백본은 여러 프레임의 입력을 사용하여 특징을 추출할 수 있다. The segmentation device may use a temporal backbone that integrates features over time to capture temporal information for a-modal mask prediction. The temporal backbone can extract features using input from multiple frames.

세그먼테이션 장치는 예를 들어, 현재 프레임 및 인접 프레임 간의 픽셀 레벨(pixel level)의 움직임을 나타내는 광학 흐름(optical flow)을 사용하여 특징을 정렬하고 현재 프레임의 좌표에 있는 모든 특징들을 통합할 수 있다. 세그먼테이션 장치는 광학 흐름을 추정하고, 추정한 광학 흐름을 기초로, 와핑을 통해 추출한 특징들을 현재 프레임과 공간적으로 정렬할 수 있다. The segmentation device may align the features using, for example, an optical flow representing pixel-level motion between the current frame and an adjacent frame, and may integrate all features at coordinates of the current frame. The segmentation apparatus may estimate an optical flow and spatially align features extracted through warping with the current frame based on the estimated optical flow.

도 3의 (a)를 참조하면, 검은 색으로 표시된 객체가 이전 프레임에서 현재 프레임의 오른쪽 하단으로 이동하는 상황이 도시된다. 직관적으로 객체의 특징은 마스크를 예측할 때 현재 프레임의 위치에 있어야 한다. 하지만, 객체가 이동하는 경우, 도 3의 (a)와 같이 시간이 지남에 따라 객체의 특징을 정확하게 통합하지 못할 수 있다. Referring to FIG. 3A , a situation in which an object displayed in black moves from the previous frame to the lower right corner of the current frame is illustrated. Intuitively, the feature of the object should be in the position of the current frame when predicting the mask. However, when the object moves, it may not be possible to accurately integrate the characteristics of the object over time as shown in FIG. 3A .

세그먼테이션 장치는 도 3의 (b)와 같이 이전 프레임의 객체를 현재 프레임에 정렬하는 광학 흐름을 사용하여 이전 프레임을 현재 프레임으로 와핑함으로써 객체가 이동하는 경우에도 객체의 특징을 현재 프레임에 정확하게 통합할 수 있다. As shown in FIG. 3(b), the segmentation device warps the previous frame to the current frame using an optical flow that aligns the object of the previous frame to the current frame. can

세그먼테이션 장치가 특징들을 현재 프레임과 공간적으로 정렬('흐름 정렬')하는 과정은 다음과 같다. A process in which the segmentation device spatially aligns features with the current frame ('flow alignment') is as follows.

예를 들어, 현재 프레임

과 이전 프레임

이 입력되면, 세그먼테이션 장치는 두 프레임 사이의 픽셀 레벨(pixel level)의 움직임을 나타내는 광학 흐름

를 추정할 수 있다. 현재 프레임

에서

에 위치한 픽셀이 주어지면, 이전 프레임

에서 동일한 픽셀의 해당 위치

가 존재할 수 있다. 이때, 광학 흐름은 예를 들어, 아래의 수학식 2와 같이 두 픽셀들의 위치 간의 상대적 오프셋에 해당할 수 있다. For example, the current frame

and previous frame

When this is input, the segmentation device generates an optical flow representing a pixel level movement between two frames.

can be estimated. current frame

at

given the pixel located at the previous frame

corresponding position of the same pixel in

may exist. In this case, the optical flow may correspond to, for example, a relative offset between positions of two pixels as shown in Equation 2 below.

세그먼테이션 장치는 와핑을 통해 각 프레임의 특징들을 공간적으로 정렬하는 데 광학 흐름을 사용할 수 있다. 세그먼테이션 장치는 예를 들어, ResNet 백본을 사용하여

를 통해 참조하는 프레임

,

의 특징들을 추출할 수 있다. 여기서 H, W, C는 특징 채널들(feature channels)의 높이(height), 너비(width) 및 특징 채널들 (feature channel)의 개수를 나타낼 수 있다. 세그먼테이션 장치는 모든 채널에서 차별화 가능한 샘플링을 수행하기 위해 이중 선형 커널(bi-linear kernel)을 사용하여 특징들

을 와팡할 수 있다. 광학 흐름이 정수 그리드(integer grid)의 위치와 일치하지 않는 연속 추정이기 때문에 세그먼테이션 장치는 4 개의 가까운 지점에 대한 이중 선형 보간을 통해 특징들을 통합할 수 있다. The segmentation device can use the optical flow to spatially align the features of each frame through warping. The segmentation device, for example, uses a ResNet backbone

frame referenced through

,

features can be extracted. Here, H, W, and C may represent the height and width of feature channels and the number of feature channels. The segmentation device uses a bi-linear kernel to perform differentiated sampling on all channels.

can wobble Since the optical flow is a continuous estimate that does not coincide with the position of the integer grid, the segmentation device can integrate features through bilinear interpolation for four nearby points.

세그먼테이션 장치는 예를 들어, 아래의 수학식 3을 통해 와핑된 특징(warped feature)

을 획득할 수 있다. 수학식 3은 이중 선형 보간(bilinear interpolation)을 수행하는 수식에 해당할 수 있다.The segmentation device may have, for example, a warped feature through Equation 3 below.

can be obtained. Equation 3 may correspond to an equation for performing bilinear interpolation.

여기서

는 time (t-1)에서의 특징('와핑된 특징')을 나타내고,

는 부동 소수점(floating point) 또는 정수(integer point)

에서의 특징 값을 나타낼 수 있다. 또한,

는 time t에서의 특징('기준(reference) 특징')을 나타내고,

는 정수인

에서의 특징 값을 나타낼 수 있다. here

represents the feature ('warped feature') at time (t-1),

is a floating point or integer

It can represent the feature value in . In addition,

denotes the feature ('reference feature') at time t,

is an integer

It can represent the feature value in .

이때,

는 부동 소수점 또는 정수로 이루어진 점들로서, 세그먼테이션 장치는 와핑되기 전 특징(

)에서 해당하는 값을 직접적으로 가져올 수 없다. 이는 와핑되기 전 특징 값들이 정수 그리드(grid)에서만 존재하기 때문이다. 따라서, 세그먼테이션 장치는 와핑되기 전 특징(

)에서

에 인접한 4 개 점들에서의 이중 선형 보간(bilinear interpolation)을 통해 와핑된 특징

값을 산출할 수 있다. At this time,

are points made of floating point or integer, and the segmentation device

) cannot be directly retrieved. This is because feature values exist only in an integer grid before warping. Therefore, the segmentation device is characterized before being warped (

)at

Warped features through bilinear interpolation at 4 points adjacent to

value can be calculated.

및

는 정수 그리드에서

에 가장 가까운 4 개 점들(왼쪽 위, 오른쪽 위, 왼쪽 아래 및 오른쪽 아래)의 집합일 수 있다. 이때,

의 왼쪽 상단에 있는 적분 포인트(integral point)는

일 수 있다. 세그먼테이션 장치는

로 주어진 모션에 따라

의 공간적 위치를 와핑할 수 있다.

and

is in the integer grid

may be the set of four points closest to (top-left, top-right, bottom-left, and bottom-right). At this time,

The integral point in the upper left of

can be the segmentation device

according to the motion given by

can warp the spatial location of

예를 들어

= (2.3, 3.5)라고 가정하자. 이 경우,

는

주변의 정수인 점들에 해당할 수 있다. 다시 말해,

는 (2,3), (3,3), (2,4), (3,4)로 이루어진 집합에 해당할 수 있다. 따라서, (2.3, 3.5)의 특징 값 추정 시에 세그먼테이션 장치는 위 4개 점들의 특징 값을 가중합(weighted sum)할 수 있다. 수학식 3에서

는 이중 선형 보간 시의 가중치(weight)가 될 수 있다.for example

= (2.3, 3.5). in this case,

Is

It can correspond to points that are integers around them. In other words,

may correspond to a set consisting of (2,3), (3,3), (2,4), and (3,4). Accordingly, when estimating the feature value of (2.3, 3.5), the segmentation apparatus may perform a weighted sum of the feature values of the above four points. in Equation 3

may be a weight during bilinear interpolation.

도 4는 일 실시예에 따라 특징 맵을 산출하는 과정을 설명하기 위한 도면이다. 도 4를 참조하면, 일 실시예에 따른 세그먼테이션 장치가 Conv3D 레이어, Conv2D 레이어, 및 스킵-연결(skip-connection)을 통해 특징을 공간 및 시간적으로 통합하는 과정이 도시된다. 4 is a diagram for explaining a process of calculating a feature map according to an embodiment. Referring to FIG. 4 , a process of spatially and temporally integrating features through a Conv3D layer, a Conv2D layer, and a skip-connection by the segmentation apparatus according to an embodiment is illustrated.

객체가 이동하면 시간이 지남에 따라 객체의 특징을 정확하게 통합하지 못할 수 있다. 세그먼테이션 장치는 도 3의 (b)를 통해 전술한 것과 같이 이전 프레임의 객체를 현재 프레임에 정렬하는 광학 흐름을 사용하여 이전 프레임을 현재 프레임으로 와핑 함으로써 객체 이동 시에도 특징을 정확하게 통합하여 특징 맵을 산출할 수 있다. If an object moves, it may not accurately incorporate the features of the object over time. The segmentation device warps the previous frame to the current frame using an optical flow that aligns the object of the previous frame to the current frame as described above through (b) of FIG. can be calculated.

세그먼테이션 장치는 시간 차원(time dimension)에 걸쳐 와핑된 특징

을 특징

에 연결(concatenate)하고, 통합(aggregate)함으로써 특징 맵

을 생성할 수 있다. 특징 맵

은 예를 들어, 아래의 수학식 4 및 5에 의해 산출될 수 있다. The segmentation device features warped across the time dimension

Features

By concatenating and aggregating the feature map

can create feature map

may be calculated by, for example, Equations 4 and 5 below.

특징 맵

이 산출되면, 세그먼테이션 장치는 박스 헤드(box-head) 및 마스크 헤드(mask-heads)와 특징 맵을 공유하여 바운딩 박스 및 해당 에이-모달 마스크를 예측할 수 있다. feature map

When ? is calculated, the segmentation apparatus may share a feature map with a box-head and mask-heads to predict a bounding box and a corresponding a-modal mask.

도 5는 일 실시예에 따른 에이-모달 세그먼테이션(amodal segmentation)을 위한 바운딩 박스와 모달 세그먼테이션(modal segmentation)을 위한 바운딩 박스에 대한 IoU(intersection over union) 히스토그램을 도시한 도면이다. 도 5의 (a)는 SAILVOS 트레이닝 세트를 이용하여 모달 마스크 및 에이-모달 마스크 각각을 위한 바운딩 박스의 IoU를 나타낸 그래프이다. 도 5의 (b)는 COCO-A 트레이닝 세트를 이용하여 모달 마스크 및 에이-모달 마스크를 위한 바우딩 박스의 IoU를 나타낸 그래프이다. 5 is a diagram illustrating an intersection over union (IoU) histogram for a bounding box for a-modal segmentation and a bounding box for modal segmentation according to an embodiment. Figure 5 (a) is a graph showing the IoU of the bounding box for each modal mask and a-modal mask using the SAILVOS training set. Figure 5 (b) is a graph showing the IoU of the bounding box for the modal mask and the A-modal mask using the COCO-A training set.

영상 내에 가려진 영역이 있는 경우, 에이-모달 마스크를 위한 바운딩 박스의 예측은 가려진 영역이 있는 상태에서 객체의 크기와 모양을 추론해야 하므로 용이하지 않다. 또한, 영상 프레임 내에 과도하게 가려진 영역이 있는 경우, 세그먼테이션 장치가 비 최대 억제에 의해 인해 적합한 바운딩 박스 후보를 잘못 제거할 수도 있다. When there is an occluded region in the image, prediction of the bounding box for the a-modal mask is not easy because the size and shape of the object must be inferred in the presence of the occluded region. In addition, if there is an excessively occluded region within the image frame, the segmentation device may erroneously remove suitable bounding box candidates due to non-maximal suppression.

세그먼테이션 장치는 비-최대 억제 기법에 기반한 캐스케이드(cascade) 구조를 이용하여 인스턴스들 각각에 대응하는 바운딩 박스(바운딩 박스 후보들)를 필터링함으로써 초기의 부정확한 에이-모달 마스크의 예측을 연속적으로 개선할 수 있다. 모달 설정(modal setting)과 달리 에이-모달 마스크의 바운딩 박스의 실측값(ground-truth)은 다른 객체를 나타내는 바운딩 박스와 더 자주 겹칠 수 있다. The segmentation device can continuously improve the prediction of the initial incorrect a-modal mask by filtering the bounding box (bounding box candidates) corresponding to each of the instances using a cascade structure based on the non-maximum suppression technique. there is. Unlike the modal setting, the ground-truth of the bounding box of an a-modal mask may more often overlap with the bounding box representing other objects.

세그먼테이션 장치는 특징 맵

이 주어지면 캐스케이드 구조의 일련의 L 단계

에서 현재 프레임에 포함된 인스턴스들(instances) 각각에 대응하는 관심 영역의 특징을 연속적으로 개선할 수 있다. The segmentation device is a feature map

A series of L steps in a cascade structure given

may continuously improve the characteristic of the ROI corresponding to each of the instances included in the current frame.

보다 구체적으로, 세그먼테이션 장치는 각 단계

의 객체 후보 집합

에 대해 ROI Align을 적용하여 객체와 관련된 관심 영역의 특징

를 공간적으로 크랍(crop)할 수 있다. 여기서, ROI Align은 객체 검출을 위해 인접 픽셀들로 바운딩 박스들을 이동시키는 경우, 입력 영상의 위치 정보가 왜곡될 수 있으므로 선형 보간 등을 통해 위치 정보가 포함된 관심 영역(ROI)에 대응하는 바운딩 박스들의 크기를 맞춰주는 과정으로 이해될 수 있다. More specifically, the segmentation device for each step

object candidate set of

Characteristics of the region of interest related to the object by applying ROI Align to

can be spatially cropped. Here, in ROI Align, when moving bounding boxes to adjacent pixels for object detection, location information of an input image may be distorted, so a bounding box corresponding to a region of interest (ROI) including location information through linear interpolation or the like. It can be understood as a process of matching the size of the

세그먼테이션 장치는 박스-리그레서(box-regressor)

를 통해 각 관심 영역의 특징

에서 새로운 바운딩 박스를 예측할 수 있다. 이때, 바운딩 박스는 해당 관심 영역에 대응하는 객체성 점수(objectness score)

및 위치

를 포함할 수 있다. 세그먼테이션 장치는 예측된 바운딩 박스 및 객체성 점수의 집합을 Soft-NMS를 기반으로 필터링하여 다음 단계에서 사용되는

를 준비할 수 있다. The segmentation device is a box-regressor.

Features of each region of interest through

can predict a new bounding box in In this case, the bounding box is an objectness score corresponding to the region of interest.

and location

may include The segmentation device filters the set of predicted bounding boxes and objectivity scores based on Soft-NMS to be used in the next step.

can prepare

예를 들어, 캐스케이드 구조의 1단계에서

와

가 주어지면, 세그먼테이션 장치는 아래의 수학식 6 내지 수학식 8을 산출할 수 있다. For example, in the first stage of the cascade structure

Wow

Given , the segmentation apparatus may calculate Equations 6 to 8 below.

세그먼테이션 장치는 최종 감지된 객체 집합, 다시 말해

을 참조하기 위해 객체 후보 집합

를 사용할 수 있다. The segmentation device is the final set of detected objects, i.e.

set of object candidates to refer to

can be used

세그먼테이션 장치는

를 얻기 위해 RPN(region proposal network)을 사용하여 초기 객체 후보 집합(initial object candidates set)을 추정할 수 있다. 세그먼테이션 장치는 객체 후보 집합

이 주어지면 최종 에이-모달 세그먼테이션

의 최종 집합을 예측할 수 있다. the segmentation device

To obtain , an initial object candidate set can be estimated using a region proposal network (RPN). The segmentation device is a set of object candidates

Given a final a-modal segmentation

can predict the final set of

도 6은 일 실시예에 따라 수용 필드를 확장하여 관심 영역의 특징을 공간적으로 전파하는 방법을 설명하기 위한 도면이다. 도 6을 참조하면, 크기가 작은 수용 필드(receptive field)(615)에 의해 가려진 영역에 있는 사람에 대한 정보가 많이 확보되지 않는 상황을 나타낸 도면(610) 및 확장된 수용 필드(635)에 의해 사람에 대한 정보 이외에 많은 배경 영역에 대한 정보가 확보되는 상황을 나타낸 도면(630)이 도시된다. 6 is a diagram for describing a method of spatially propagating a feature of an ROI by expanding a receptive field, according to an embodiment. Referring to FIG. 6 , a diagram 610 showing a situation in which a lot of information about a person in an area covered by a small receptive field 615 is not secured and an expanded receptive field 635 A diagram 630 showing a situation in which information about many background areas is secured in addition to information about a person is shown.

도면(610)의 수용 필드(615)는 크기가 작아 영상의 일부에 대응하는 로컬 정보를 수용할 수 있다. 수용 필드(615)가 가려진 영역을 포함하는 경우, 수용 필드(615)를 통해 수용할 수 있는 사람에 대한 정보는 그리 많지 않다. The accommodating field 615 of the drawing 610 may accommodate local information corresponding to a part of the image due to its small size. When the receptive field 615 includes an area covered by the receptive field 615 , there is not much information about people who can be accommodated through the receptive field 615 .

이와 달리, 도면(630)의 수용 필드(635)는 크기가 커서 영상의 많은 영역에 대한 정보를 수용할 수 있다. 다시 말해, 수용 필드(635)의 크기가 크면, 예를 들어, 관심 객체 또는 타겟 인스턴스에 해당하는 사람 이외에 많은 배경 영역에 걸쳐 특징이 통합될 수 있다. 세그먼테이션 장치는 확장된 수용 필드(635)에 의해 보다 많은 영역의 특징들을 통합하는 동시에, 자기-집중(self-attention)을 통해 해당 인스턴스에 대응하는 관심 영역의 특징에 집중하여 에이-모달 마스크가 예측되도록 할 수 있다. On the other hand, the receiving field 635 of the drawing 630 has a large size to accommodate information on many areas of an image. In other words, if the size of the receptive field 635 is large, for example, features may be integrated over many background regions other than the person corresponding to the object of interest or target instance. The segmentation device integrates features of more regions by the extended receptive field 635, while focusing on the features of the region of interest corresponding to the instance through self-attention, so that the a-modal mask is predicted can make it happen

세그먼테이션 장치는 에이-모달 마스크의 예측을 위해 도 6에 도시된 것과 같이 책에 의해 가려진 영역을 무시하면서 관심 객체(예를 들어, 사용자)의 보이는 부분에서 가려진 영역으로 특징을 공간적으로 전파할 수 있다. 세그먼테이션 장치는 반복적인 마스크 헤드(Iterative Mask Head)에 의한 자기 집중을 통해 배경의 특징이 점차적으로 무시되도록 할 수 있다. 반복적인 마스크 헤드는 관심 영역의 특징을 기초로 예측된 객체 클래스에 대응하여 에이-모달 마스크에 대응될 수 있다.The segmentation device can spatially propagate features from the visible part of the object of interest (eg, the user) to the occluded region while ignoring the region occluded by the book as shown in FIG. 6 for prediction of the a-modal mask. . The segmentation apparatus may gradually ignore a characteristic of a background through magnetic concentration by an iterative mask head. The iterative mask head may correspond to the a-modal mask corresponding to an object class predicted based on the characteristic of the region of interest.

세그먼테이션 장치는 예를 들어, 아래의 표 1에 기재된 Module 1과 같이 동작하는 반복적인 마스크 헤드를 통해 영상에 포함된 타겟 인스턴스에 집중하여 에이-모달 마스크를 예측할 수 있다. For example, the segmentation apparatus may predict the a-modal mask by focusing on the target instance included in the image through a repetitive mask head that operates as in Module 1 described in Table 1 below.

특징 맵

와 객체 후보

가 산출되면, 세그먼테이션 장치는 해당 에이-모달 마스크

을 출력할 수 있다. 이를 위해, 세그먼테이션 장치는 먼저 전술한 ROI Align을 사용하여 관심 영역의 특징

을 크랍할 수 있다. 이때, 관심 영역의 특징

은 객체 레벨의 특징에 해당할 수 있다. feature map

and object candidates

is calculated, the segmentation device uses the corresponding a-modal mask

can be printed out. To this end, the segmentation device first uses the aforementioned ROI Align to characterize the region of interest.

can be cropped. At this time, the characteristics of the region of interest

may correspond to an object-level characteristic.

세그먼테이션 장치는 예를 들어, 9 개의 컨볼루션 레이어들(convolution layers)을 통해 관심 영역의 특징

를 전달하여 예를 들어, 수용 필드가 14 x 14의 공간 크기를 갖는 전체 입력을 커버하도록 할 수 있다. 세그먼테이션 장치는 컨볼루션 레이어들을 통해 보이는 영역에 대응하는 특징을 전달하여 수용 필드(receptive field)를 가려진 영역으로 확장함으로써 관심 영역의 특징

을 공간적으로 전파할 수 있다. The segmentation apparatus, for example, features the region of interest through nine convolutional layers.

can be passed so that, for example, the receptive field covers the entire input with a spatial size of 14 x 14. The segmentation device transmits the feature corresponding to the visible region through the convolutional layers and extends the receptive field to the obscured region.

can be propagated spatially.

세그먼테이션 장치는 디컨볼루션 레이어들(deconvolution layers)에 의해 14 x 14의 공간 크기를 갖는 관심 영역의 특징

의 공간 차원(spatial dimension)을 28 x 28의 확장된 공간 크기를 갖는 관심 영역의 특징

로 변환할 수 있다. The segmentation device features a region of interest with a spatial size of 14 x 14 by deconvolution layers.

The spatial dimension of the region of interest with an extended spatial size of 28 x 28

can be converted to

도 7은 일 실시예에 따른 객체 클래스에 대응하여 예측된 에이-모달 마스크를 반복적으로 보정하는 방법을 설명하기 위한 도면이다. 도 7을 참조하면, 일 실시예에 따른 세그먼테이션 장치가 28 x 28의 확장된 공간 크기를 갖는 관심 영역의 특징

(710)를 기초로, 예측된 에이-모달 마스크를 반복적으로 보정하는 과정이 도시된다. 세그먼테이션 장치는 예를 들어, L2 = 3 반복을 통해 예측된 에이-모달 마스크를 보정할 수 있다. 7 is a diagram for explaining a method of iteratively correcting an a-modal mask predicted corresponding to an object class according to an embodiment. Referring to FIG. 7 , the segmentation apparatus according to an exemplary embodiment has a feature of a region of interest having an expanded spatial size of 28×28.

Based on 710 , a process of iteratively correcting the predicted a-modal mask is shown. The segmentation apparatus may correct the predicted a-modal mask through, for example, L2 = 3 iterations.

세그먼테이션 장치는 관심 영역의 특징

(710)을 기초로, 객체 클래스에 대응하는 초기 집중 마스크(initial attention mask)

(720)를 예측할 수 있다. 여기서, 초기 집중 마스크

(720)는 관심 영역의 특징에 해당하는 모든 채널들에 대응하는 에이-모달 집중 마스크에 해당할 수 있다. The segmentation device is characterized by the region of interest

Based on 710 , an initial attention mask corresponding to the object class

(720) can be predicted. Here, the initial concentration mask

Reference numeral 720 may correspond to an a-modal concentration mask corresponding to all channels corresponding to the characteristic of the region of interest.

세그먼테이션 장치는 초기 집중 마스크

(720)에서 객체 클래스에 대응하는 초기 마스크

(730)를 추출할 수 있다. The segmentation device is the initial concentration mask

Initial mask corresponding to object class at 720

(730) can be extracted.

세그먼테이션 장치는 초기 마스크

(730)에 관심 영역의 특징

(710)을 반복적으로 적용하여 에이-모달 마스크를 생성할 수 있다. The segmentation device is the initial mask

730 features of the region of interest

By repeatedly applying 710, an a-modal mask can be generated.

보다 구체적으로, 세그먼테이션 장치는 초기 마스크

(730)를 관심 영역의 특징

(710)에 적용하여 제1 마스킹함으로써 제1 마스킹된 특징

(740)을 생성할 수 있다. More specifically, the segmentation device is an initial mask

730 features of the region of interest

First masked feature by first masking applied to 710

740 may be created.

세그먼테이션 장치는 제1 마스킹된 특징

(740)을 기초로, 객체 클래스에 대응하는 집중 마스크(760)를 예측할 수 있다. 세그먼테이션 장치는 제1 마스킹된 특징

(740)에 대한 컨볼루션 및 시그모이드 연산을 통해 집중 마스크(750)를 생성할 수 있다. 세그먼테이션 장치는 집중 마스크(750)에서 객체 클래스에 대응하는 마스크(760)를 추출할 수 있다. The segmentation device has a first masked feature

Based on 740 , a concentration mask 760 corresponding to the object class may be predicted. The segmentation device has a first masked feature

A concentration mask 750 may be generated through convolution and sigmoid operation on 740 . The segmentation apparatus may extract a mask 760 corresponding to an object class from the concentration mask 750 .

세그먼테이션 장치는 마스크(760)를 관심 영역의 특징

(710)에 적용하여 제2 마스킹함으로써 제2 마스킹된 특징

(770)을 생성할 수 있다. 세그먼테이션 장치는 제2 마스킹된 특징

(770)을 기초로, 에이-모달 마스크

(780)를 생성할 수 있다. 세그먼테이션 장치는

(780)에 의해 인스턴스들을 세그먼테이션할 수 있다. The segmentation device uses the mask 760 to characterize the region of interest.

A second masked feature by applying a second masking to 710

770 can be created. The segmentation device has a second masked feature

Based on (770), a-modal mask

780 can be created. the segmentation device

Instances may be segmented by 780 .

전술한 과정은 아래의 표 2에 기재된 Module 2의 동작과 같이 표현될 수 있다. The above-described process can be expressed as the operation of Module 2 described in Table 2 below.

예를 들어, 세그먼테이션 장치가 캐스케이드 구조의 각 레벨

에서 관심 영역의 특징

및 관심 영역에 포함된 모든 클래스들의 스코어들(s) 및 크기(b)를 포함하는 바운딩 박스들을 획득했다고 하자. For example, if the segmentation device is used at each level of the cascade

Features of the region of interest in

and bounding boxes including scores (s) and sizes (b) of all classes included in the region of interest.

세그먼테이션 장치는 인스턴스들 각각에 대응하는 바운딩 박스들에 의해 관심 영역에 포함된 모든 클래스들의 스코어들을 기초로, 객체 클래스

를 예측할 수 있다. 세그먼테이션 장치는 해당 객체의 스코어(s)를 기초로, 해당 객체 클래스

를 예측할 수 있다. 여기서, s[k]는 클래스 k의 스코어(s)에 해당할 수 있다. The segmentation apparatus determines the object class based on the scores of all classes included in the region of interest by bounding boxes corresponding to each of the instances.

can be predicted The segmentation device based on the score (s) of the object, the object class

can be predicted Here, s[k] may correspond to the score (s) of class k.

세그먼테이션 장치는 관심 영역의 특징

(710, 740, 770)를 사용하여 모든 객체 클래스들에 대한 에이-모달 집중 마스크

(720, 750)을 예측할 수 있다. 세그먼테이션 장치는 예측된 클래스

, 다시 말해

에 대한 집중 채널을 관심 영역의 특징

(710, 740, 770)과 요소 별 곱셈(element-wise multiply)(

)할 수 있다. 요소 별 곱셈을 통해 영상 프레임 내에 포함된 배경의 특징이 점차 0으로 밀려나고, 타겟 인스턴스에 해당하는 객체의 특징이 강조될 수 있다. The segmentation device is characterized by the region of interest

A-modal focus mask for all object classes using (710, 740, 770)

(720, 750) can be predicted. The segmentation device is the predicted class

, In other words

Focus channels on the features of the region of interest

(710, 740, 770) and element-wise multiply (

)can do. Through element-by-element multiplication, the characteristic of the background included in the image frame is gradually pushed to 0, and the characteristic of the object corresponding to the target instance may be emphasized.

세그먼테이션 장치는 컨볼루션 레이어를 통해 최종적으로 에이-모달 마스크

(780)을 생성할 수 있다. The segmentation device finally passes through the convolutional layer to an a-modal mask.

780 can be created.

도 8은 일 실시예에 따른 세그먼테이션 장치의 블록도이다. 도 8을 참조하면, 일 실시예에 따른 세그먼테이션 장치(800)는 통신 인터페이스(810), 프로세서(830), 메모리(850), 및 디스플레이(870)를 포함할 수 있다. 통신 인터페이스(810), 프로세서(830), 메모리(850), 및 디스플레이(870)는 통신 버스(805)를 통해 서로 연결될 수 있다. 8 is a block diagram of a segmentation apparatus according to an embodiment. Referring to FIG. 8 , the segmentation apparatus 800 according to an embodiment may include a communication interface 810 , a processor 830 , a memory 850 , and a display 870 . The communication interface 810 , the processor 830 , the memory 850 , and the display 870 may be connected to each other through the communication bus 805 .

통신 인터페이스(810)는 현재 프레임 및 현재 프레임에 인접한 적어도 하나의 인접 프레임을 포함하는 영상 프레임들을 수신한다. 적어도 하나의 인접 프레임은 현재 프레임에 인접한 이전 프레임(previous frame)일 수도 있고, 현재 프레임에 인접한 다음 프레임(next frame)일 수도 있다. The communication interface 810 receives image frames including a current frame and at least one adjacent frame adjacent to the current frame. The at least one adjacent frame may be a previous frame adjacent to the current frame or a next frame adjacent to the current frame.

프로세서(830)는 통신 인터페이스(810)를 통해 수신한 현재 프레임 및 인접 프레임 간의 시간 정보를 기초로, 영상 프레임들을 통합하는 특징 맵(feature map)을 산출한다. 프로세서(830)는 특징 맵으로부터 현재 프레임에 포함된 인스턴스들(instances) 각각에 대응하는 관심 영역의 특징을 추출한다. 프로세서(830)는 관심 영역의 특징을 기초로, 관심 영역에 대응하는 객체 클래스(class)를 예측한다. 프로세서(830)는 관심 영역의 특징을 기초로, 객체 클래스에 대응하여 예측된 에이-모달 마스크를 반복적으로 보정함으로써 인스턴스들을 세그먼테이션한다. The processor 830 calculates a feature map integrating the image frames based on the time information between the current frame and the adjacent frame received through the communication interface 810 . The processor 830 extracts a feature of the ROI corresponding to each of the instances included in the current frame from the feature map. The processor 830 predicts an object class corresponding to the ROI based on the characteristics of the ROI. The processor 830 segments the instances by iteratively correcting the predicted a-modal mask corresponding to the object class based on the characteristic of the region of interest.

또한, 프로세서(830)는 도 1 내지 도 7을 통해 전술한 적어도 하나의 방법 또는 적어도 하나의 방법에 대응되는 알고리즘을 수행할 수 있다. 프로세서(830)는 목적하는 동작들(desired operations)을 실행시키기 위한 물리적인 구조를 갖는 회로를 가지는 하드웨어로 구현된 세그먼테이션 장치일 수 있다. 예를 들어, 목적하는 동작들은 프로그램에 포함된 코드(code) 또는 인스트럭션들(instructions)을 포함할 수 있다. 예를 들어, 하드웨어로 구현된 세그먼테이션 장치(800)는 마이크로프로세서(microprocessor), 중앙 처리 장치(Central Processing Unit; CPU), 그래픽 처리 장치(Graphic Processing Unit; GPU), 프로세서 코어(processor core), 멀티-코어 프로세서(multi-core processor), 멀티프로세서(multiprocessor), ASIC(Application-Specific Integrated Circuit), FPGA(Field Programmable Gate Array), NPU(Neural Processing Unit) 등을 포함할 수 있다.Also, the processor 830 may perform at least one method described above with reference to FIGS. 1 to 7 or an algorithm corresponding to the at least one method. The processor 830 may be a hardware-implemented segmentation device having a circuit having a physical structure for executing desired operations. For example, desired operations may include code or instructions included in a program. For example, the segmentation device 800 implemented as hardware includes a microprocessor, a central processing unit (CPU), a graphic processing unit (GPU), a processor core, and a multi - It may include a multi-core processor, a multiprocessor, an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), a neural processing unit (NPU), and the like.

프로세서(830)는 프로그램을 실행하고, 세그먼테이션 장치(800)를 제어할 수 있다. 프로세서(830)에 의하여 실행되는 프로그램 코드는 메모리(850)에 저장될 수 있다.The processor 830 may execute a program and control the segmentation apparatus 800 . The program code executed by the processor 830 may be stored in the memory 850 .

메모리(850)는 통신 인터페이스(810)를 통해 수신한 영상 프레임들을 저장할 수 있다. 메모리(850)는 프로세서(830)가 생성한 에이-모달 마스크 및/또는 프로세서(830)가 인스턴스들을 세그먼테이션한 결과를 저장할 수 있다. 또한, 메모리(850)는 프로세서(830)가 산출한 특징 맵, 관심 영역의 특징, 및/또는 프로세서(830)가 예측한 객체 클래스를 저장할 수 있다. The memory 850 may store image frames received through the communication interface 810 . The memory 850 may store the a-modal mask generated by the processor 830 and/or a result of segmenting the instances by the processor 830 . Also, the memory 850 may store the feature map calculated by the processor 830 , the feature of the region of interest, and/or the object class predicted by the processor 830 .

이와 같이, 메모리(850)는 상술한 프로세서(830)의 처리 과정에서 생성되는 다양한 정보를 저장할 수 있다. 이 밖에도, 메모리(850)는 각종 데이터와 프로그램 등을 저장할 수 있다. 메모리(850)는 휘발성 메모리 또는 비휘발성 메모리를 포함할 수 있다. 메모리(850)는 하드 디스크 등과 같은 대용량 저장 매체를 구비하여 각종 데이터를 저장할 수 있다.In this way, the memory 850 may store various information generated in the process of the above-described processor 830 . In addition, the memory 850 may store various data and programs. The memory 850 may include a volatile memory or a non-volatile memory. The memory 850 may include a mass storage medium such as a hard disk to store various data.

실시예에 따라서, 세그먼테이션 장치(800)는 프로세서(830)에 의해 생성된 타겟 영상을 디스플레이(870)를 통해 표시할 수도 있다. According to an embodiment, the segmentation apparatus 800 may display the target image generated by the processor 830 through the display 870 .

일 실시예에 따른 세그먼테이션 장치(800)는 예를 들어, 첨단 운전자 보조 시스템(Advanced Drivers Assistance System; ADAS), HUD(Head Up Display) 장치, 3D 디지털 정보 디스플레이(Digital Information Display, DID), 내비게이션 장치, 뉴로모픽 장치(neuromorphic device), 3D 모바일 기기, 스마트 폰, 스마트 TV, 스마트 차량, IoT(Internet of Things) 디바이스, 의료 디바이스, 및 계측 디바이스 등과 같이 다양한 분야의 장치에 해당할 수 있다. 여기서, 3D 모바일 기기는 예를 들어, 증강 현실(Augmented Reality; AR), 가상 현실(Virtual Reality; VR), 및/또는 혼합 현실(Mixed Reality; MR)을 표시하기 위한 디스플레이 장치, 머리 착용 디스플레이(Head Mounted Display; HMD) 및 얼굴 착용 디스플레이(Face Mounted Display; FMD) 등을 모두 포함하는 의미로 이해될 수 있다.The segmentation device 800 according to an exemplary embodiment includes, for example, an advanced driver assistance system (ADAS), a head up display (HUD) device, a 3D digital information display (DID), and a navigation device. , a neuromorphic device, a 3D mobile device, a smart phone, a smart TV, a smart vehicle, an Internet of Things (IoT) device, a medical device, and a measurement device may correspond to devices in various fields. Here, the 3D mobile device is, for example, a display device for displaying Augmented Reality (AR), Virtual Reality (VR), and/or Mixed Reality (MR), a head worn display ( It may be understood to include both a Head Mounted Display (HMD) and a Face Mounted Display (FMD).

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented by a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the apparatus, methods, and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA) array), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, may be implemented using a general purpose computer or special purpose computer. The processing device may execute an operating system (OS) and a software application running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or apparatus, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in a computer-readable recording medium.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 저장할 수 있으며 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may store program instructions, data files, data structures, etc. alone or in combination, and the program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. there is. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

위에서 설명한 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 또는 복수의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or a plurality of software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited drawings, those of ordinary skill in the art may apply various technical modifications and variations based thereon. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

receiving image frames including a current frame and at least one adjacent frame adjacent to the current frame;
calculating a feature map for aggregating the image frames based on time information between the current frame and the adjacent frame;
extracting a feature of a region of interest (ROI) corresponding to each of the instances included in the current frame from the feature map;
predicting an object class corresponding to the region of interest based on the characteristics of the region of interest; and
segmenting the instances by iteratively correcting an a-modal mask predicted corresponding to the object class based on the characteristic of the region of interest;
Including, a segmentation method.

According to claim 1,
Segmenting the instances includes:
predicting an a-modal mask corresponding to the object class based on the characteristics of the region of interest; and
segmenting the instances by repeating the operation of applying the predicted a-modal mask to the feature of the region of interest;
Including, a segmentation method.

3. The method of claim 2,
The step of predicting the a-modal mask is
predicting the a-modal mask corresponding to the target instance by propagating a feature of the region of interest from a visible area of a target instance corresponding to the object class to an occluded area;
Including, a segmentation method.

4. The method of claim 3,
Predicting the a-modal mask comprises:
spatially propagating a feature of the region of interest by propagating a feature corresponding to the visible region through convolution layers to extend a receptive field to the occluded region;
extending the spatial dimension of the feature of the region of interest by deconvolution layers; and
predicting the a-modal mask corresponding to the target instance in the extended spatial dimension;
Including, a segmentation method.

According to claim 1,
Segmenting the instances includes:
iteratively predicting the a-modal mask by spatially propagating a feature of the region of interest from a visible region of the target instance corresponding to the object class to an obscured region of the target instance;
predicting a modal mask corresponding to the visible region based on the characteristic of the region of interest;
predicting an occlusion mask corresponding to the occluded region based on the characteristics of the ROI; and
segmenting the instances based on a combination of the a-modal mask, the modal mask, and the occlusion mask;
Including, a segmentation method.

6. The method of claim 5,
Segmenting the instances based on a combination of the a-modal mask, the modal mask, and the occlusion mask includes:
calculating a first reliability corresponding to the pixel-by-pixel probability of the modal mask;
calculating a second reliability corresponding to the pixel-by-pixel probability of the occlusion mask;
weighting the a-modal mask by a confidence map based on at least one of the first confidence level and the second confidence level; and
segmenting the instances by the weighted a-modal mask;
Including, a segmentation method.

According to claim 1,
Segmenting the instances includes:
predicting an initial attention mask corresponding to the object class based on the characteristic of the region of interest;
extracting an initial mask corresponding to the object class from the initial concentration mask; and
generating the a-modal mask by repeatedly applying the characteristics of the region of interest to the initial mask; and
segmenting the instances by the a-modal mask.
Including, a segmentation method.

8. The method of claim 7,
The step of generating the a-modal mask is
performing a first mask by applying the initial mask to a feature of the region of interest;
predicting a concentration mask corresponding to the object class based on the first masked feature;
performing second masking by applying a characteristic of the region of interest to the concentration mask; and
generating the a-modal mask based on the second masked feature;
Including, a segmentation method.

According to claim 1,
The step of extracting the feature of the region of interest is
extracting the feature of the region of interest from the feature map by a region proposal network (RPN);
Including, a segmentation method.

According to claim 1,
The step of extracting the feature of the region of interest is
selecting any one instance including an area covered by the current frame from among the instances; and
extracting a feature of a region of interest corresponding to the selected instance from the feature map by the region suggestion network;
Including, a segmentation method.

11. The method of claim 10,
The step of extracting the feature of the ROI corresponding to the selected instance includes:
Using a cascade structure based on a non-maximum suppression (NMS) technique, bounding boxes corresponding to each of the instances-the bounding boxes are the locations of the instances and the instances. filtering - including a corresponding objectness score; and
extracting a feature of the region of interest corresponding to the selected instance from the feature map based on the filtered bounding boxes;
Including, a segmentation method.

According to claim 1,
Predicting the object class includes:
calculating scores of all classes included in the ROI by bounding boxes corresponding to each of the instances based on the characteristic of the ROI; and
predicting the object class based on the scores of all the object classes;
Including, a segmentation method.

According to claim 1,
The step of calculating the feature map is
extracting features corresponding to the image frames;
spatially aligning the features with the current frame by warping between the current frame and the adjacent frame; and
calculating the feature map through integration of the sorted features
Including, a segmentation method.

14. The method of claim 13,
The step of spatially aligning the features with the current frame comprises:
estimating an optical flow representing a pixel level movement between the current frame and the adjacent frame; and
based on the optical flow, spatially aligning the features with the current frame through the warping;
Including, a segmentation method.

A computer program stored in a computer-readable recording medium in combination with hardware to execute the method of any one of claims 1 to 14.

a communication interface for receiving image frames including a current frame and at least one adjacent frame adjacent to the current frame; and
calculating a feature map integrating the image frames based on time information between the current frame and the adjacent frame, extracting a feature of a region of interest corresponding to each of the instances included in the current frame from the feature map, Predict an object class corresponding to the region of interest based on the characteristic of the region of interest, and calculate the instances by iteratively correcting an a-modal mask predicted corresponding to the object class based on the characteristic of the region of interest. Segmenting Processor
Including, a segmentation device.

17. The method of claim 16,
the processor is
segmenting the instances by repeating the operation of predicting an a-modal mask corresponding to the object class based on the characteristic of the region of interest and applying the predicted a-modal mask to the characteristic of the region of interest,
segmentation device.

18. The method of claim 17,
the processor is
predicting the a-modal mask corresponding to the target instance by propagating a feature of the region of interest from a visible region to an occluded region of the target instance corresponding to the object class;
segmentation device.

17. The method of claim 16,
the processor is
Iteratively predicts the a-modal mask by spatially propagating a feature of the region of interest from a visible region of the target instance corresponding to the object class to an obscured region of the target instance, and based on the characteristic of the region of interest, predict a modal mask corresponding to a visible region, predict an occlusion mask corresponding to the occluded region based on a characteristic of the region of interest, and predict the a-modal mask, the modal mask, and the occlusion mask segmenting the instances based on a combination of

17. The method of claim 16,
the processor is
Predicting an initial concentration mask corresponding to the object class based on the characteristic of the region of interest, extracting an initial mask corresponding to the object class from the initial concentration mask, and repeatedly applying the characteristic of the region of interest to the initial mask to generate the a-modal mask by applying to, and segmenting the instances by the a-modal mask,
segmentation device.