KR102169243B1

KR102169243B1 - Semantic segmentation method of 3D reconstructed model using incremental fusion of 2D semantic predictions

Info

Publication number: KR102169243B1
Application number: KR1020180170999A
Authority: KR
Inventors: 이승용; 전준호; 정진웅; 김준건
Original assignee: 포항공과대학교 산학협력단
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2020-10-23
Also published as: KR20200080970A

Abstract

본 발명은 보급용 깊이 영상 카메라에서 온 연속적인 색상 및 깊이 영상 스트림(stream)으로부터 삼차원 복원과 동시에 복원된 모델에 대한 점진적 의미론적 분할을 수행하는 이차원 의미론적 분할 정보의 점진적인 혼합을 통한 삼차원 복원 모델의 의미론적 분할 방법에 관한 것으로, 본 발명에 의한 삼차원 복원 모델의 의미론적 분할 방법은 (a) 다중의 2차원 입력 영상에 대하여, 각 입력 영상의 색상 영상(RGB)과 대응하는 깊이 영상(Depth)을 활용하여 딥러닝 기반 픽셀단위 의미론적 분할을 수행하여 각 픽셀마다 객체 부류에 따른 확률 정보를 획득하는 단계; (b) 각 픽셀마다 획득한 확률 정보를 복셀 그리드에 레이캐스팅으로 업데이트 하는 단계; (c) 마칭큐브 알고리즘에 의해 복셀 그리드에서 메시 모델을 추출하는 단계; 및 (d) 메시 모델에서 각 정점에 대해서 가장 높은 확률을 갖는 부류를 선택함으로써 삼차원 복원 모델의 의미론적 분할을 수행하는 단계;를 포함하여 구성된다.The present invention is a three-dimensional restoration model through gradual mixing of two-dimensional semantic segmentation information that performs three-dimensional restoration from a continuous color and depth image stream from a diffusion depth image camera and gradual semantic segmentation of the reconstructed model. In the semantic segmentation method of the three-dimensional reconstruction model according to the present invention, the semantic segmentation method of (a) a depth image corresponding to the color image (RGB) of each input image (Depth ) To perform deep learning-based semantic segmentation by pixel to obtain probability information according to an object class for each pixel; (b) updating the probability information obtained for each pixel in the voxel grid by raycasting; (c) extracting a mesh model from a voxel grid by a marching cube algorithm; And (d) performing semantic division of the 3D reconstructed model by selecting a class having the highest probability for each vertex in the mesh model.

Description

Semantic segmentation method of 3D reconstructed model using incremental fusion of 2D semantic predictions}

본 발명은 RGBD 카메라의 색상 및 깊이 영상 스트림으로부터 복원된 삼차원모델의 삼차원 의미론적 분할을 수행할 때 다중의 이차원 영상의 의미론적 분할 정보를 혼합하는 방법에 관한 것이다. The present invention relates to a method of mixing semantic segmentation information of multiple 2D images when performing 3D semantic segmentation of a 3D model reconstructed from a color and depth image stream of an RGBD camera.

삼차원 복원이라 함은 레이저 스캐너(laser scanner), 구조광 기반 깊이 카메라 등 다양한 스캔 장비를 활용하여 관심 객체 혹은 환경에 대한 삼차원 위치 및 색상 정보를 획득함을 의미한다.Three-dimensional restoration means acquiring three-dimensional position and color information on an object of interest or environment by using various scanning devices such as a laser scanner and a structured light-based depth camera.

키넥트(Kinect) 카메라와 같은 보급형 깊이 카메라가 출현되고 다양한 알고리즘들이 개발됨에 따라, 작은 규모의 대상(e.g., 사람)만을 비실시간으로 복원할 수 있었던 기술적 수준에서 비싼 스캔 장비가 없더라도 실시간 복원을 가능케 하는 기술적 성숙도를 보일 수 있게 되었다. 실시간 삼차원 복원 기술로서 대중적인 인기를 불러일으킨 KinectFusion 기술을 시작으로, 복원 공간 크기의 제약을 해결한 Voxel Hashing, BundleFusion 등의 대규모 복원 기술이 잇따라 등장하였다.With the advent of low-end depth cameras such as Kinect cameras and the development of various algorithms, real-time restoration is possible even without expensive scanning equipment at the technical level that only small-sized objects (eg, people) could be restored in non-real time. It has become possible to show the level of technological maturity. Starting with the popular KinectFusion technology as a real-time 3D restoration technology, large-scale restoration technologies such as Voxel Hashing and BundleFusion, which solved the limitation of the restoration space size, appeared one after another.

깊이 카메라를 이용한 삼차원 복원을 하기 위해 수행하는 보편적인 프로세스는 다음과 같다. 우선 카메라의 포즈(pose)(혹은 회전하는 물체의 포즈)를 매 프레임마다 계산한다. 포즈를 구하는 방법으로는 주로 plane-to-point error를 고려한 ICP(Iterative Closest Point)의 이형(variant)이 사용되며, 이 알고리즘의 입력으로는 현재 깊이 영상 정보와 모델의 레이캐스팅(raycasting) 깊이 정보가 이용된다. The general process performed to perform 3D restoration using a depth camera is as follows. First, the pose of the camera (or the pose of a rotating object) is calculated every frame. As a method of obtaining a pose, a variant of the ICP (Iterative Closest Point), which considers the plane-to-point error, is mainly used, and the input of this algorithm is the current depth image information and the raycasting depth information of the model. Is used.

고가의 스캐너와 달리 보급형 깊이 카메라는 깊이 영상 값에 많은 노이즈가 섞여 있으며 이를 해결하기 위해 TSDF(Truncated Signed Distance Function)의 표현에서의 평균 기법이 사용된다. 깊이 카메라에서 연속적으로 들어오는 영상에서 TSDF 값을 계산하고 이를 미리 구성한 복셀 그리드에 저장하며, 이미 저장되어 있는 TSDF 값이 존재한다면 기존의 값과 가중치 합을 통해 새로운 TSDF 값을 계산한다. 스캔을 완료한 후에 TSDF 값이 저장된 복셀 그리드에 Marching cubes 알고리즘을 적용하여 최종적인 모델의 메시를 뽑아낸다.Unlike expensive scanners, popular depth cameras contain a lot of noise in the depth image value, and to solve this, the average technique in the expression of TSDF (Truncated Signed Distance Function) is used. TSDF values are calculated from images continuously coming from the depth camera and stored in a pre-configured voxel grid. If there is already stored TSDF values, a new TSDF value is calculated using the sum of the existing values and weights. After the scan is completed, the final model mesh is extracted by applying the Marching cubes algorithm to the voxel grid where the TSDF values are stored.

기존에 삼차원 복원 모델에서 객체를 분류하는 몇몇 연구가 진행되었다. In the past, several studies have been conducted to classify objects in 3D restoration models.

Liangliang Nan, Ke Xie, 및 Andrei Sharf의 "A search-classify approach for cluttered indoor scene understanding,"[문헌 1]에서는 포인트 클라우드(point cloud) 형태의 모델에 과분할(over-segmentation)을 수행하여 조각난 패치(patch)를 만들고, 이를 다시 누적하여 구성한 영역에 대한 분류 확률(classification likelihood)을 계산한다. 이 확률을 사용하여 현재 부분 모델을 가장 유사한 깔끔한 모델로 교체를 수행한다. 그러나 포인트 클라우드 상의 특징을 사용하기 때문에 정확성이 떨어지고 상당히 제한된 부류에만 적용될 수 있다. Liangliang Nan, Ke Xie, and Andrei Sharf's "A search-classify approach for cluttered indoor scene understanding," [Reference 1], is a fragmented patch by performing over-segmentation on a point cloud type model. (patch) is created, and the classification likelihood is calculated for the configured area by accumulating it again. Using this probability, we perform replacement of the current partial model with the most similar neat model. However, since it uses features on the point cloud, it is less accurate and can only be applied to a fairly limited class.

Dai A., Chang A. X., Savva M., Halber M., Funkhouser T., Niessner M의 "Scannet: Richly-annotated 3D reconstructions of indoor scenes,"[문헌 2]는 [문헌 1]과 달리 전체 장면을 복셀화한 데이터를 3D CNN(Convolution Neural Network)의 입력으로 사용하여 출력으로 각 복셀이 레이블링된 결과를 얻게 된다. Dai A., Chang AX, Savva M., Halber M., Funkhouser T., Niessner M's "Scannet: Richly-annotated 3D reconstructions of indoor scenes," [Reference 2], is a voxel of the entire scene unlike [Reference 1]. Using the converted data as an input of a 3D CNN (Convolution Neural Network), a result of labeling each voxel as an output is obtained.

Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, 및 Baining Guo의 "An interactive approach to semantic modeling of indoor scenes with an RGBD camera,"[문헌 3]은 RGBD 영상에 CRF를 사용한 의미론적 분할을 적용하고 분할된 객체를 random regression forest로 분류한다. 또한 장면 정합(scene registration)에서 SIFT(Scale-Invariant Feature Transform), RANSAC(RANdom SAmple Consensus)과 의미론적 분할 레이블 정보를 활용한다. 그러나 이 기술은 KinectFusion 방식의 대규모 삼차원 복원 알고리즘과 융합되기 어렵다."An interactive approach to semantic modeling of indoor scenes with an RGBD camera," by Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and Baining Guo, describes semantic segmentation using CRF in RGBD images. Apply and classify the divided objects into a random regression forest. Also, in scene registration, SIFT (Scale-Invariant Feature Transform), RANSAC (RANdom SAmple Consensus) and semantic segmentation label information are used. However, this technique is difficult to fusion with the large-scale 3D reconstruction algorithm of the KinectFusion method.

Liangliang Nan, Ke Xie, and Andrei Sharf. "A search-classify approach for cluttered indoor scene understanding," ACM Trans. on Graph., 31(6):137:1-137:10, 2012. Liangliang Nan, Ke Xie, and Andrei Sharf. "A search-classify approach for cluttered indoor scene understanding," ACM Trans. on Graph., 31(6):137:1-137:10, 2012. Dai A., Chang A. X., Savva M., Halber M., Funkhouser T., Niessner M. "Scannet: Richly-annotated 3D reconstructions of indoor scenes," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. Dai A., Chang A. X., Savva M., Halber M., Funkhouser T., Niessner M. "Scannet: Richly-annotated 3D reconstructions of indoor scenes," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and Baining Guo. "An interactive approach to semantic modeling of indoor scenes with an RGBD camera," ACM Trans. on Graph., 31(6):136:1-11, 2012. Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and Baining Guo. "An interactive approach to semantic modeling of indoor scenes with an RGBD camera," ACM Trans. on Graph., 31(6):136:1-11, 2012. Seong-Jin Park, Ki-Sang Hong, Seungyong Lee. "RDFNet: RGB-D Multi-Level Residual Feature Fusion for Indoor Semantic Segmentation," The IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4980-4989. Seong-Jin Park, Ki-Sang Hong, Seungyong Lee. "RDFNet: RGB-D Multi-Level Residual Feature Fusion for Indoor Semantic Segmentation," The IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4980-4989. Dai A., Niessner M., Zollhofer M., Izadi S., Theobalt C.: Bundlefusion: Real-time globally consistent 3D reconstruction using on-the-fly surface reintegration. ACM Transactions on Graphics (TOG) 36, 3 (2017), 24 Dai A., Niessner M., Zollhofer M., Izadi S., Theobalt C.: Bundlefusion: Real-time globally consistent 3D reconstruction using on-the-fly surface reintegration. ACM Transactions on Graphics (TOG) 36, 3 (2017), 24 Lorensen W. E., Cline H. E.: Marching cubes: A high resolution 3D surface construction algorithm. In ACM Transactions on Graphics Lorensen W. E., Cline H. E.: Marching cubes: A high resolution 3D surface construction algorithm. In ACM Transactions on Graphics (TOG) (1987), vol. 21, ACM, pp. 163-169(TOG) (1987), vol. 21, ACM, pp. 163-169

본 발명은 보급용 깊이 영상 카메라에서 온 연속적인 색상 및 깊이 영상 스트림(stream)으로부터 삼차원 복원과 동시에 복원된 모델에 대한 점진적 의미론적 분할을 수행하기 위한 이차원 의미론적 분할 정보의 점진적인 혼합 방법을 제공함을 그 목적으로 한다.The present invention provides a method for gradual mixing of two-dimensional semantic segmentation information for performing a gradual semantic segmentation of a model reconstructed simultaneously with three-dimensional restoration from a continuous color and depth image stream from a depth image camera for distribution. For that purpose.

상기의 목적을 달성하기 위하여, 본 발명에 의한 이차원 의미론적 분할 정보의 점진적인 혼합을 통한 삼차원 복원 모델의 의미론적 분할 방법은 (a) 다중의 2차원 입력 영상에 대하여, 각 입력 영상의 색상 영상(RGB)과 대응하는 깊이 영상(Depth)을 활용하여 딥러닝 기반 픽셀단위 의미론적 분할을 수행하여 각 픽셀마다 객체 부류에 따른 확률 정보를 획득하는 단계; (b) 각 픽셀마다 획득한 확률 정보를 복셀 그리드에 레이캐스팅으로 업데이트 하는 단계; (c) 마칭큐브 알고리즘에 의해 상기 복셀 그리드에서 메시 모델을 추출하는 단계; 및 (d) 상기 메시 모델에서 각 정점에 대해서 가장 높은 확률을 갖는 부류를 선택함으로써 삼차원 복원 모델의 의미론적 분할을 수행하는 단계;를 포함하여 구성된다.In order to achieve the above object, the semantic segmentation method of a three-dimensional reconstruction model through gradual mixing of two-dimensional semantic segmentation information according to the present invention is (a) for multiple two-dimensional input images, the color image of each input image ( Obtaining probability information according to an object class for each pixel by performing deep learning-based semantic segmentation by pixel using a depth image corresponding to RGB); (b) updating the probability information obtained for each pixel in the voxel grid by raycasting; (c) extracting a mesh model from the voxel grid by a marching cube algorithm; And (d) performing semantic division of the 3D reconstructed model by selecting a class having the highest probability for each vertex in the mesh model.

상기의 이차원 의미론적 분할 정보의 점진적인 혼합을 통한 삼차원 복원 모델의 의미론적 분할 방법에 있어서, 상기 (b) 단계는 각 픽셀의 객체 부류에 따른 확률 정보에 물체의 카메라로부터의 거리 및 전·배경 경계에 따라 결정된 가중치를 부가하여 해당 픽셀에 대응되는 복셀에 저장된 객체 부류에 따른 확률 정보에 가중 평균하여 업데이트 하는 것을 특징으로 한다.In the semantic segmentation method of the three-dimensional reconstruction model through gradual mixing of the two-dimensional semantic segmentation information, step (b) includes the distance from the camera of the object and the front/background boundary in probability information according to the object class of each pixel. By adding a weight determined according to a weighted average, probability information according to an object class stored in a voxel corresponding to a corresponding pixel is weighted and updated.

상기의 이차원 의미론적 분할 정보의 점진적인 혼합을 통한 삼차원 복원 모델의 의미론적 분할 방법에 있어서, 상기 (c) 단계의 메시 모델의 각 정점(vertex)의 20개의 확률은 양선형 보간(bilinear interpolation)을 통해 결정됨을 특징으로 한다.In the semantic segmentation method of the three-dimensional reconstruction model through the gradual mixing of the two-dimensional semantic segmentation information, the 20 probabilities of each vertex of the mesh model in step (c) are bilinear interpolation. It is characterized by being determined through.

상기의 이차원 의미론적 분할 정보의 점진적인 혼합을 통한 삼차원 복원 모델의 의미론적 분할 방법에 있어서, 픽셀의 확률 정보, 가중치 및 정점의 확률은 각각 객체 부류의 개수만큼의 차원을 갖는 벡터임을 특징으로 한다.In the semantic segmentation method of the three-dimensional reconstruction model through the gradual mixing of the two-dimensional semantic segmentation information, the probability information of pixels, the weights, and the probability of vertices are each a vector having a dimension equal to the number of object classes.

상기의 이차원 의미론적 분할 정보의 점진적인 혼합을 통한 삼차원 복원 모델의 의미론적 분할 방법에 있어서, 상기 (b) 단계에서 각 복셀

의 t번째 프레임까지 통합된 부류 확률은 수학식In the semantic segmentation method of a three-dimensional restoration model through gradual mixing of the two-dimensional semantic segmentation information, each voxel in step (b)

The class probability integrated up to the t-th frame of

(여기서

와

는 각각 복셀

의 t-1번째 프레임까지 통합된 부류 확률과 신뢰도 가중치이고,

와

는 각각 t번째 프레임에서 픽셀 p의 부류 확률과 신뢰도 가중치이다)에 의해 산출되고,(here

Wow

Is each voxel

Is the class probability and reliability weight integrated up to the t-1th frame of,

Wow

Is calculated by the class probability and reliability weight of pixel p in each t-th frame),

신뢰도 가중치

는 수학식Confidence weight

Is the equation

(여기서

는 깊이 기반 정확도 가중치이고,

는 전·배경 경계 오정렬 가중치이다)에 의해 산출되는 것을 특징으로 한다.(here

Is the depth-based accuracy weight,

Is the weight of the misalignment of the front/background boundary).

상기의 이차원 의미론적 분할 정보의 점진적인 혼합을 통한 삼차원 복원 모델의 의미론적 분할 방법에 있어서, 전·배경 경계 오정렬 가중치

는 수학식In the semantic segmentation method of the three-dimensional restoration model through the gradual mixing of the two-dimensional semantic segmentation information, the pre-background boundary misalignment weight

Is the equation

(여기서

,

은 각각 현재 픽셀

의 깊이 값, 픽셀

위치를 중심으로 한 윈도우 안에서 최소 깊이 값 및 최대 깊이 값이고,

,

는 양의 상수이다)에 의해 산출되는 것을 특징으로 한다.(here

,

Each is the current pixel

Depth value in pixels

The minimum and maximum depth values within the window centered on the location,

,

Is a positive constant).

상기의 목적을 달성하기 위하여, 본 발명에 의한 컴퓨터로 읽을 수 있는 기록 매체는 상기의 이차원 의미론적 분할 정보의 점진적인 혼합을 통한 삼차원 복원 모델의 의미론적 분할 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한다.In order to achieve the above object, the computer-readable recording medium according to the present invention records a program for executing the semantic segmentation method of the three-dimensional restoration model through the gradual mixing of the two-dimensional semantic segmentation information. .

상기의 목적을 달성하기 위하여, 본 발명에 의한 컴퓨터 프로그램은 상기의 이차원 의미론적 분할 정보의 점진적인 혼합을 통한 삼차원 복원 모델의 의미론적 분할 방법을 컴퓨터에서 실행시키기 위하여 매체에 저장된다.In order to achieve the above object, the computer program according to the present invention is stored in a medium in order to execute the semantic segmentation method of the three-dimensional reconstructed model through the gradual mixing of the two-dimensional semantic segmentation information.

본 발명에 의하면, RGBD 카메라를 활용한 삼차원 복원에서 점진적으로 기하를 완성해 나가는 것처럼 이차원 영상에서의 의미론적 분할 확률정보를 활용하여 최종적으로 삼차원 모델에 대한 삼차원 의미론적 분할 정보를 얻어내므로, 근본적으로 충분한 계산 능력만 받쳐 준다면 실시간 방법으로 쉽게 이전이 가능할 뿐만 아니라 기존 여타 방법론이 겪고 있는 문제점(메모리 제한으로 인한 대규모 복원 모델에서의 해상도(resolution) 하락 등)을 해결하여 대규모 복원 모델에서도 섬세한 기하표현은 유지함과 동시에 의미론적 분할 결과를 획득할 수 있다. According to the present invention, the three-dimensional semantic segmentation information for the three-dimensional model is finally obtained by using semantic segmentation probability information in a two-dimensional image, just as the geometry is gradually completed in three-dimensional restoration using an RGBD camera. If it supports sufficient computational power, it can be easily transferred to the real-time method, and it also solves the problems encountered by other methodologies (resolution decreases in large-scale restoration models due to memory limitations) and expresses delicate geometry even in large-scale restoration models. It is possible to maintain semantic division results while maintaining.

이러한 의미론적 분할 결과는 증강현실 및 가상현실 등에 다양한 사용자와의 상호작용이 필요한 경우에 활용될 수 있다. 간단한 활용 예로 실내 인테리어 구조물 재배치에서 사용자는 의미론적 분할의 결과를 이용하여 자유롭게 원하는 객체(예를 들어 의자, 책상)등을 원 메시에서 분리하여 자유롭게 이동시킬 수 있게 된다.This semantic segmentation result can be used when interaction with various users is required, such as augmented reality and virtual reality. As a simple application example, in the relocation of the indoor interior structure, the user can freely move the desired object (for example, chair, desk) by separating it from the original mesh using the result of semantic division.

도 1은 본 발명에 의한 이차원 의미론적 분할 정보의 점진적인 혼합을 통한 삼차원 복원 모델의 의미론적 분할 방법의 전체적인 흐름을 설명하기 위한 도면이다.
도 2는 의미론적 분할 정보 혼합에 사용하는 가중치 맵을 도시한 것이다.
도 3은 최신 딥 러닝 기반의 의미론적 분할 기술의 거리에 따른 예측 정확도를 도시한 것이다.
도 4는 주요 부류들의 확률 시각화 및 최종 의미론적 분할 결과를 도시한 것이다.
도 5는 실내 환경 복원에서 본 발명에 의한 의미론적 분할 결과를 도시한 것이다.1 is a diagram for explaining the overall flow of a semantic segmentation method of a three-dimensional reconstruction model through gradual mixing of two-dimensional semantic segmentation information according to the present invention.
2 shows a weight map used for mixing semantic segmentation information.
3 shows prediction accuracy according to distance of the latest deep learning-based semantic segmentation technology.
4 shows the results of probabilistic visualization and final semantic division of major classes.
5 shows a result of semantic division according to the present invention in the restoration of an indoor environment.

이하에서, 첨부된 도면을 참조하면서 본 발명의 바람직한 실시예에 대하여 상세히 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명에 의한 이차원 의미론적 분할 정보의 점진적인 혼합을 통한 삼차원 복원 모델의 의미론적 분할 방법의 전체 흐름은 도 1과 같다. The overall flow of a semantic segmentation method of a 3D restoration model through gradual mixing of two-dimensional semantic segmentation information according to the present invention is shown in FIG.

우선 색상 및 깊이 입력영상을 활용한 딥 러닝 기반 의미론적 분할을 수행하여 각 픽셀마다 객체 부류에 따른 확률 정보를 획득한다. 만약 20개의 부류를 분류한다고 가정하면, 각 픽셀은 20개의 부류의 발생 확률을 포함하고 있다. 이 정보는 레이캐스팅에 따라 각 픽셀에 대응되는 복셀들에 저장 및 혼합되어진다. 혼합될 때 이차원 의미론적 분할 결과의 가중치를 해당 물체의 카메라로부터의 거리 및 전·배경 경계에 따라 적응적으로 결정한다. First, by performing deep learning-based semantic segmentation using color and depth input images, probability information according to object class is obtained for each pixel. Assuming that 20 classes are classified, each pixel contains the probability of occurrence of 20 classes. This information is stored and mixed in voxels corresponding to each pixel according to raycasting. When mixed, the weight of the two-dimensional semantic segmentation result is adaptively determined according to the distance of the object from the camera and the front/background boundary.

모든 영상 스트림(stream) 데이터에 대한 위와 같은 처리가 완료되면 마칭큐브(Marching cubes)(비특허문헌 6 참조) 알고리즘을 활용하여 메시 모델을 추출한다. 이때 마칭큐브 (Marching cubes) 과정에서의 메시의 각 정점(vertex)의 20개의 확률은 양선형 보간(bilinear interpolation)을 통해 결정된다. When the above processing for all image stream data is completed, a mesh model is extracted using a Marching cubes (refer to Non-Patent Document 6) algorithm. At this time, the 20 probabilities of each vertex of the mesh in the marching cubes process are determined through bilinear interpolation.

앞서서 결정된 확률들은 모두 이차원 영상에서 물체의 형태 및 색상 중 일부만을 보고 의미론적 분할을 수행해 얻어낸 결과이다. All of the previously determined probabilities are results obtained by performing semantic segmentation by looking at only some of the shape and color of the object in the 2D image.

최종 메시의 각 정점의 부류를 정하기 위해서는 간단히 가장 높은 확률을 갖는 부류를 선택하면 된다.To classify each vertex in the final mesh, simply select the class with the highest probability.

혼합과정에서는 각 복셀의 부류 정보마다 수학식 1과 같은 업데이트가 수행된다.In the mixing process, an update as shown in Equation 1 is performed for each class information of each voxel.

여기서

와

는 각각 복셀

의 t-1번째 프레임까지 통합된 부류 확률과 신뢰도 가중치를 의미한다.

와

는 각각 t번째 프레임에서 픽셀 p의 부류 확률과 신뢰도 가중치를 나타낸다. 만약 분류할 부류가 20개라면

는 20차원 벡터가 된다. here

Wow

Is each voxel

It means the integrated class probability and reliability weights up to the t-1th frame of.

Wow

Represents the class probability and reliability weight of the pixel p in each t-th frame. If there are 20 categories to be classified

Becomes a 20-dimensional vector.

매 프레임의 픽셀의 부류 확률을 같은 가중치(

로 처리하게 되면 여전히 부정확한 결과를 얻을 수 있다. 이를 경감하기 위해 본 발명은 수학식 2와 같은 적응적 가중치를 활용한다. The class probability of the pixels in each frame is equally weighted (

If you do it, you can still get inaccurate results. To alleviate this, the present invention utilizes an adaptive weight as shown in Equation 2.

여기서

는 깊이 기반 정확도 가중치이고,

는 전·배경 경계 오정렬 가중치이다. 관련 가중치 맵은 도 2에서 확인할 수 있다.here

Is the depth-based accuracy weight,

Is the weight of the misalignment of the front/background boundary. The related weight map can be found in FIG. 2.

일반적으로 CNN(convolution neural network)은 고정된 수용영역(receptive field) 크기를 갖고 있고, 의미론적 분할의 정확도는 영상에서의 객체 크기에 따라 다양해진다. 깊이 기반 정확도 가중치

는 이를 반영하기 위한 가중치이다.In general, a convolution neural network (CNN) has a fixed receptive field size, and the accuracy of semantic division varies according to the size of an object in an image. Depth-based accuracy weighting

Is the weight to reflect this.

깊이 영상의 기반 정확도 가중치를 계산하기 위해서 본 발명에서 사용한 딥 러닝 기반 방법론인 RDFNet(비특허문헌 4 참조)의 이차원 의미론적 분할결과의 성능을 계산해야 한다. 우선 Training set의 깊이 및 색상 영상을 RDFNet의 입력으로 받아 의미론적 분할 결과를 추정한다. 의미론적 분할 영상에는 각 픽셀 마다 추정된 부류 정보(20 개의 부류 중에 하나)가 저장되어 있고, 이를 참 값(Ground truth)과의 비교를 통해 깊이 값에 대한 의미론적 분할 결과의 평균 성능을 계산할 수 있다. 이를 그래프로 그리면 도 3의 파란색 실선과 같다. 이는 이산적인 깊이 값에 대한 그래프며, 따라서 연속된(continuous) 깊이 값에 대한 노이즈 제거된 가중치 그래프를 획득하기 위해 4차 다항식으로 피팅(Fitting)을 수행한다. 그 결과는 도 2에 노란색 실선으로 나타난다. 깊이 기반 정확도 가중치

는 이와 같이 피팅된 다항식 함수를 통해 결정된다. In order to calculate the base accuracy weight of the depth image, the performance of the two-dimensional semantic segmentation result of RDFNet (see Non-Patent Document 4), which is a deep learning-based methodology used in the present invention, must be calculated. First, the depth and color images of the training set are received as inputs from RDFNet and the semantic segmentation results are estimated. Class information (one of 20 classes) estimated for each pixel is stored in the semantic segmentation image, and the average performance of the semantic segmentation result for the depth value can be calculated by comparing it with the ground truth. have. Drawing this as a graph is like the blue solid line in FIG. 3. This is a graph of discrete depth values, and therefore, fitting is performed with a fourth-order polynomial to obtain a weight graph with noise-removed weights for continuous depth values. The result is indicated by a solid yellow line in FIG. 2. Depth-based accuracy weighting

Is determined through the polynomial function fitted in this way.

의미론적 분할 결과는 주로 색상 영상에 의존하며 깊이 영상은 보충적으로 사용된다. 그러나 RGBD 카메라의 캘리브레이션(calibration)이 잘 되었을지라도 여전히 색상과 깊이 영상 사이에서의 오정렬이 존재하기 마련이다. 특히, 이런 오정렬은 전경(물체)과 배경 (주로 벽과 바닥)사이에서 심화된다. 이러한 오정렬은 결과적으로 경계 주변의 복셀에서의 잘못된 레이블링을 가져오게 된다. 본 발명은 이런 문제를 완화하기 위해 깊이 영상의 에지(edge)를 탐지하고 이를 바탕으로 가중치를 생성한다. The semantic segmentation result mainly depends on the color image, and the depth image is used as a supplement. However, even if the RGBD camera is well calibrated, there is still a misalignment between the color and depth images. In particular, this misalignment deepens between the foreground (object) and the background (mainly the wall and floor). This misalignment results in incorrect labeling in voxels around the boundary. In order to alleviate this problem, the present invention detects an edge of a depth image and generates a weight based on it.

에지를 판별하고 가중치를 결정하는 방법은 다음과 같다. 픽셀에서 7x7 윈도우를 씌워 편차가 미리 정해진 상수 값보다 큰 경우 에지라고 탐지하고, 수학식 3을 통해 가중치를 결정한다. The method of determining the edge and determining the weight is as follows. If the deviation is larger than a predetermined constant value by covering a 7x7 window in the pixel, it is detected as an edge, and a weight is determined through Equation 3.

여기서

,

은 각각 현재 픽셀

의 깊이 값, 픽셀

위치를 중심으로 한 윈도우 안에서 최소 깊이 값 및 최대 깊이 값을 의미한다.

,

는 양의 상수를 의미한다. here

,

Each is the current pixel

Depth value in pixels

It means the minimum depth value and the maximum depth value in the window centered on the position.

,

Means positive constant.

혼합과정이 끝나면 각 복셀마다 수학식 1을 통해 혼합된 20개의 확률이 존재한다. 현재 장면은 복셀 그리드 형태로 표현되어 있으므로, 이를 보통 그래픽스에서 표현하는 메시 형태로 바꾸기 위해서는 마칭큐브(Marching cubes) 알고리즘이 필요하다. 일반적인 경우에는 복셀에 TSDF값 혹은 색상 값이 포함되어 있는데, 본 발명의 경우 20개의 확률이 존재하므로 마칭큐브(Marching cubes) 과정에서 확률 값을 양선형 보간하여 최종적으로 메시의 각 정점의 20개 확률 값을 결정한다.When the mixing process is finished, there are 20 probabilities mixed through Equation 1 for each voxel. Since the current scene is expressed in the form of a voxel grid, a marching cubes algorithm is required to convert it into a mesh form expressed in normal graphics. In a general case, the voxel contains a TSDF value or a color value.In the present invention, since there are 20 probabilities, the probability values are bilinearly interpolated in the marching cubes process and finally 20 probabilities of each vertex of the mesh Determine the value.

앞선 과정을 통해 20개 부류 확률을 포함한 메시가 추출되면, 도 5의 왼쪽처럼 주요 부류에 대한 확률맵을 얻을 수 있으며, 최종적인 의미론적 분할 결과는 각 정점에 대해서 가장 높은 확률을 갖는 부류를 선택함으로써 얻을 수 있다(도 5의 오른쪽 참조).When a mesh including 20 class probabilities is extracted through the preceding process, a probability map for the main classes can be obtained as shown on the left side of FIG. 5, and the final semantic division result selects the class with the highest probability for each vertex. It can be obtained by doing (see the right side of Fig. 5).

도 5는 실내 환경 복원에서 본 발명에 의한 의미론적 분할 결과를 도시한 것이다.5 shows a result of semantic division according to the present invention in the restoration of an indoor environment.

한편, 상술한 본 발명의 실시예는 개인용 컴퓨터를 포함한 범용 컴퓨터에서 사용되는 매체에 기록될 수 있다. 상기 매체는 마그네틱 기록매체(예를 들면, 롬, 플로피 디스크, 하드 디스크 등), 광학적 판독매체(예를 들면, 씨디롬, 디브이디 등) 및 전기적 기록매체(예를 들면, 플레쉬 메모리, 메모리 스틱 등)와 같은 기록매체를 포함한다.Meanwhile, the above-described embodiments of the present invention may be recorded on a medium used in general-purpose computers including personal computers. The medium is a magnetic recording medium (e.g., ROM, floppy disk, hard disk, etc.), optical reading medium (e.g., CD-ROM, DVD, etc.), and electrical recording medium (e.g., flash memory, memory stick, etc.) Includes recording media such as.

이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예는 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been looked at around its preferred embodiments. Those of ordinary skill in the art to which the present invention pertains will be able to understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered from an illustrative point of view rather than a limiting point of view. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

Claims

(a) For multiple 2D input images, by using the color image (RGB) of each input image and the corresponding depth image (Depth), deep learning-based semantic segmentation is performed for each pixel according to the object class. Obtaining probability information;
(b) updating the probability information obtained for each pixel in the voxel grid by raycasting;
(c) extracting a mesh model from the voxel grid by a marching cube algorithm; And
(d) performing semantic division of the three-dimensional reconstructed model by selecting a class having the highest probability for each vertex in the mesh model; and
The semantic segmentation method of a three-dimensional restoration model through gradual mixing of two-dimensional semantic segmentation information, characterized in that the probability of each vertex of the mesh model in step (c) is determined through bilinear interpolation.

The method of claim 1, wherein step (b)
A weighted average of probability information according to the object class stored in the voxel corresponding to the corresponding pixel is updated by adding a weight determined according to the distance from the camera and the front/background boundary to the probability information according to the object class of each pixel. Semantic segmentation method of 3D reconstruction model through gradual mixing of 2D semantic segmentation information.

delete

The method of claim 1, wherein the probability information of pixels, the weights, and the probability of vertices are vectors having dimensions equal to the number of object classes, respectively, by gradual mixing of two-dimensional semantic segmentation information. .

The method of claim 2,
Each voxel in step (b)

The class probability integrated up to the t-th frame of

(here

Wow

Is each voxel

Wow

Is the class probability and reliability weight of pixel p in each tth frame)
Is calculated by,
Confidence weight

Is the equation

(here

Is the depth-based accuracy weight,

Is the weight of the misalignment of the front/background boundary)
Semantic segmentation method of a three-dimensional restoration model through gradual mixing of two-dimensional semantic segmentation information, characterized in that calculated by

The weight of claim 5, wherein the misalignment weight of the front/background boundary

Is the equation

(here

,

Each is the current pixel

Depth value in pixels

,

Is a positive constant)
Semantic segmentation method of a three-dimensional restoration model through gradual mixing of two-dimensional semantic segmentation information, characterized in that calculated by

Any one of claims 1, 2, and 4 to 6 can be read by a computer that records a program for executing the semantic segmentation method of a 3D restoration model through gradual mixing of the 2D semantic segmentation information. Recording medium.

A computer program stored in a medium to execute the semantic segmentation method of a three-dimensional reconstructed model through gradual mixing of the two-dimensional semantic segmentation information of any one of claims 1, 2, and 4 to 6.