KR101970488B1

KR101970488B1 - RGB-D Multi-layer Residual Feature Fusion Network for Indoor Semantic Segmentation

Info

Publication number: KR101970488B1
Application number: KR1020170182798A
Authority: KR
Inventors: 박성진; 홍기상; 이승용
Original assignee: 포항공과대학교 산학협력단
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2019-04-19

Abstract

The present invention relates to a network apparatus using one indoor color image and a depth image thereof to perform deep learning-based pixel unit semantic segmentation. According to the present invention, the network apparatus comprises: a first residual network module including a plurality of steps, which are connected from an upper step to a lower step and formed with a convolutional neural network (CNN) for each step, to extract gradual feature information from a color image; a second residual network module including a plurality of steps, which are connected from an upper step to a lower step and formed with a CNN for each step, to extract gradual feature information from a depth image; and a multimodal feature fusion network (MMFNet) module fusing the feature information extracted from each step of the first and second residual network modules. The MMFNet module is formed by sequentially connecting: a convolution block reducing dimensions of the feature information with respect to color and depth feature information extracted from a step corresponding to the first and second residual network modules for each step in order to smoothen a rapid increase of a parameter; two residual convolution units performing nonlinear modification for shape combination; and a convolution block adaptively combining feature information of different types and adjusting a scale of a feature value for addition. Moreover, scaled color feature information and scaled depth feature information are combined by addition. The present invention effectively extracts and combines various dimensions of feature information at the same time such that the feature information can be efficiently learned.

Description

Technical Field [0001] The present invention relates to a color-depth image, and more particularly, to an RGB-D multi-layer Residual Feature Fusion Network for Indoor Semantic Segmentation

본 발명은 한 장의 실내 컬러 영상(RGB)과 그에 대한 깊이 영상(Depth)을 활용하여 딥러닝 기반 픽셀단위 의미론적 분할 (semantic segmentation)을 수행하는 네트워크 장치에 관한 것이다. The present invention relates to a network apparatus that performs deep learning based pixel unit semantic segmentation utilizing a single indoor color image (RGB) and a depth image thereof.

영상을 픽셀 단위로 의미론적 분할하여 인식하는 기술은 많은 영상 처리 기술의 근본이 될 수 있는 기술로, 자율 주행과 같은 분야에서 특히 매우 중요한 영상 정보를 제공할 수 있다. The technique of recognizing the image in semantically divided pixels is a technology that can be the basis of many image processing techniques and can provide particularly important image information in fields such as autonomous driving.

일반적인 컬러 영상에 대한 의미론적 분할의 최신 기술은 RefineNet이 있다(도 1 참고). RefineNet은 기존의 영상 분류 기술에 사용되었던 Kaiming He의 "Deep Residual Larning for Image Recognition (Residual Network)"를 기반으로, 영상에서 다양한 차원의 특성들을 추출한 뒤, 이를 단계적으로 결합하면서 보다 선명하고 정확한 의미론적 분할을 얻을 수 있는 네트워크 모듈을 설계한다. RefineNet is the latest technology for semantic segmentation of general color images (see Figure 1). Based on Kaiming He's "Deep Residual Larning for Image Recognition (Residual Network)", which was used in existing image classification technology, RefineNet extracts characteristics of various dimensions from images and combines them step by step to create clearer and more accurate semantic Design a network module that can obtain segmentation.

도 1에서 (a)는 의미론적 분할을 위한 네트워크 구조이고, (b)는 RefineNet 블록의 상세 구성도이고, (c) 내지 (e)는 각각 RefineNet의 하위 모듈들에 대한 상세 구성도이다. 도 1에서 Conv는 컨볼루션(convolution), ReLU는 Rectified Linear Unit 활성화 함수, Pool은 Pooling 연산을 수행한다.In FIG. 1, (a) is a network structure for semantic division, (b) is a detailed configuration diagram of a refineNet block, and (c) to (e) are detailed configuration diagrams of sub-modules of RefineNet. In FIG. 1, Conv performs convolution, ReLU performs a rectified linear unit activation function, and Pool performs a pooling operation.

그런데 실내 영상의 경우 일반적으로 영상의 복잡도가 높아서 좋은 인식 성능을 얻기에 어려움이 있는 반면, Microsoft Kinect와 같은 장비로부터 깊이 영상을 얻기가 용이해 이를 컬러 영상과 함께 사용하여 정확도를 개선할 수 있다.However, it is difficult to obtain good recognition performance due to high image complexity in indoor image. On the other hand, it is easy to acquire depth image from equipment such as Microsoft Kinect, and it can be used with color image to improve accuracy.

도 2를 참조하면, 컬러 영상과 깊이 영상을 함께 사용한 기존의 딥러닝 기반 기술들은 다음과 같다. 도 2에서, (a)는 네트워크 모듈의 입력으로 컬러 영상과 깊이 영상을 단순히 연결하여 넣는 방법, (b)는 각각의 영상으로부터 특성을 추출 후 이를 네트워크 모듈의 마지막에서 단순히 결합하는 방법, (c)는 J.Wang의 "Learning common and specific features for rgb-d semantic segmentation with deconvolutional networks", (d)는 C.Hazirbas의"Fusenet: Incorporating depth into semantic segmentation via fusionbased cnn architecture "에 대한 구조를 도식화한 것이다. 여기서, 'C', 'T', '+'는 각각 연결(Concatenation), 변환(Transformation), 요소 단위 덧셈(element-wise Summation)을 의미한다.Referring to FIG. 2, existing depth learning based technologies using a color image and a depth image together are as follows. 2 (a) shows a method of simply connecting a color image and a depth image as an input of a network module, FIG. 2 (b) shows a method of extracting characteristics from each image, (D) describes the structure of C. Hazirbas's "Fusenet: Incorporating depth into semantic segmentation via fusionbased cnn architecture", J. Wang's "Learning common and specific features for rgb-d semantic segmentation with deconvolutional networks" will be. Here, 'C', 'T' and '+' mean concatenation, transformation, and element-wise summation, respectively.

J.Wang의 네트워크 모듈은 우선 컬러 영상과 깊이 영상 각각으로부터 특성을 어느 정도 추출한다. 이를 특정 단계(layer)에서 각 영상의 공통 특성과 특수한 특성으로 나눈 후 다시 네트워크 모듈을 통과시킨 뒤 특성을 결합하여 최종 분할 결과를 얻는다. C.Hazirbas는 네트워크 모듈의 앞 부분인 인코더(encoder) 파트에서 각각의 영상의 특성을 추출함과 동시에 순차적으로 특성을 더해주고, 얻은 특성 맵을 디컨볼루션(deconvolution) 네트워크 모듈을 통과시켜 최종 분할 결과를 얻는다.J.Wang's network module first extracts characteristics from color and depth images, respectively. After dividing the image into common characteristics and specific characteristics of each image in a specific layer, the result is passed through the network module, and the final segmentation result is obtained by combining the characteristics. C. Hazirbas extracts the characteristics of each image from the encoder part, which is the front part of the network module, and sequentially adds the characteristics. The obtained characteristic map is passed through the deconvolution network module, Obtain the result.

이러한 네트워크들은 각각의 영상으로부터 딥러닝을 통해 추출할 수 있는 여러 단계(layer)의 특성 정보들을 충분히 활용하기 어려우며, 각각의 특성 정보들을 결합하는 부분도 직접 결정해주어야 하는 단점이 있다. 이로 인해 각각의 영상 특성을 효과적으로 추출할 수 있는 네트워크 모듈을 학습시키는 점에 있어서도 어려움이 있을 수 있다. These networks are difficult to utilize the characteristic information of various layers that can be extracted from each image through the deep learning and have a disadvantage in that it is also necessary to directly determine a part to combine each characteristic information. Therefore, it may be difficult to learn a network module capable of extracting each image characteristic effectively.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016.

본 발명은 실내 컬러 영상과 깊이 영상으로부터 의미론적 분할을 통해 영상을 인식함에 있어서, 효과적으로 다양한 차원의 특성 정보들을 추출함과 동시에 결합함으로써, 이를 효율적으로 학습시킬 수 있고 보다 정확한 의미론적 분할 결과를 얻을 수 있는 실내 의미론적 분할을 위한 컬러-깊이 영상의 단계적 레지듀얼 특성 결합 네트워크 장치 및 단계적 레지듀얼 특성 결합 네트워크 장치를 이용한 실내 컬러-깊이 영상의 의미론적 분할 방법을 제공함을 그 목적으로 한다.In the present invention, when recognizing an image through semantic division from an indoor color image and a depth image, it is possible to efficiently learn and effectively obtain a semantic division result by extracting and combining characteristic information of various dimensions effectively Depth image by using a stepwise residual characteristic combining network device and a stepwise residual characteristic combining network device for a color depth image for an indoor semantic partitioning capable of achieving an indoor color depth image.

상기의 목적을 달성하기 위하여, 본 발명에 의한 실내 의미론적 분할을 위한 컬러-깊이 영상의 단계적 레지듀얼 특성 결합 네트워크 장치는 한 장의 실내 컬러 영상(RGB)과 그에 대한 깊이 영상(Depth)을 활용하여 딥러닝 기반 픽셀단위 의미론적 분할을 수행하는 네트워크 장치에 있어서, 상위 단계에서 하위 단계로 이어지고 각 단계별로 CNN(Convolutional Neural Network)으로 된 다수의 단계로 구성되어, 컬러 영상으로부터 단계적 특성 정보를 추출하는 제1 레지듀얼 네트워크 모듈; 상위 단계에서 하위 단계로 이어지고 각 단계별로 CNN(Convolutional Neural Network)으로 된 다수의 단계로 구성되어, 깊이 영상으로부터 단계적 특성 정보를 추출하는 제2 레지듀얼 네트워크 모듈; 및 상기 제1 레지듀얼 네트워크 모듈 및 상기 제2 레지듀얼 네트워크 모듈의 각 단계별로 추출된 특성 정보를 결합하는 다중 특성 결합 네트워크(MMFNet) 모듈;을 포함하여 구성되고, 상기 다중 특성 결합 네트워크 모듈은 각 단계별로, 상기 제1 레지듀얼 네트워크 모듈 및 상기 제2 레지듀얼 네트워크 모듈의 대응하는 단계에서 추출된 컬러 특성 정보 및 깊이 특성 정보 각각에 대하여, 특성 정보의 차원을 줄여서 파라미터의 급증을 완화하는 콘볼루션 블록; 형태 결합을 위해 비선형 변형을 수행하는 2개의 레지듀얼 콜볼루션 유닛; 및 다른 형태의 특성 정보를 적응적으로 결합하고, 덧셈을 위해 특성 값의 스케일을 조정하는 콘볼루션 블록;이 차례로 연결되고, 스케일 조정된 컬러 특성 정보와 깊이 특성 정보가 덧셈 결합되는 것을 특징으로 한다.
상기 실내 의미론적 분할을 위한 컬러-깊이 영상의 단계적 레지듀얼 특성 결합 네트워크 장치에 있어서, 컬러 특성 정보와 깊이 특성 정보가 결합된 특성 정보에 문맥상의 정보를 합체시키는 레지듀얼 풀링 블록을 더 포함하여 구성된다.In order to achieve the above object, a step-and-repeat residual characteristic combining network apparatus for color depth image for indoor semantic partitioning according to the present invention utilizes a single indoor color image (RGB) and a depth image A network apparatus for performing a deep learning-based pixel-by-pixel semantic division, the network apparatus comprising a plurality of steps of a CNN (Convolutional Neural Network) A first residual network module; A second residual network module which is composed of a plurality of steps of a CNN (Convolutional Neural Network), which leads from a higher stage to a lower stage, and extracts the stepwise characteristic information from the depth image; And a multi-characteristic combination network module (MMFNet) module for combining extracted characteristic information for each step of the first and second residual network modules, A convolution for alleviating the sudden increase in the parameter by reducing the dimension of the characteristic information for each of the color characteristic information and the depth characteristic information extracted at the corresponding step of the first residual network module and the second residual network module, block; Two residual call polarization units for performing nonlinear transformation for shape combining; And a convolution block for adaptively combining other types of characteristic information and adjusting a scale of a characteristic value for addition are connected in this order and the scaled color characteristic information and depth characteristic information are additionally combined .
A stepwise residual characteristic combining network device for color depth images for interior semantic segmentation, the device further comprising a residual pooling block for combining contextual information with characteristic information in which color characteristic information and depth characteristic information are combined, do.

삭제delete

상기 실내 의미론적 분할을 위한 컬러-깊이 영상의 단계적 레지듀얼 특성 결합 네트워크 장치에 있어서, 각 단계별로 하위 단계의 정제된 특성 정보와 상기 다중 특성 결합 네트워크 모듈의 대응하는 단계에서 결합된 특성 정보를 입력으로 하여 정제하는 RefineNet을 더 포함하여 구성된다.A stepwise residual characteristic combining network device for color depth images for indoor semantic segmentation, the method comprising the steps of: inputting, at each step, refined characteristic information of a lower level and characteristic information combined at a corresponding step of the multi- And RefineNet for purifying the purified water.

상기의 목적을 달성하기 위하여, 본 발명에 의한 단계적 레지듀얼 특성 결합 네트워크 장치를 이용한 실내 컬러-깊이 영상의 의미론적 분할 방법은 한 장의 실내 컬러 영상(RGB)과 그에 대한 깊이 영상(Depth)을 활용하여 딥러닝 기반 픽셀단위 의미론적 분할을 수행하는 방법에 있어서, (a) 상위 단계에서 하위 단계로 이어지고 각 단계별로 CNN(Convolutional Neural Network)으로 된 다수의 단계로 구성된 제1 레지듀얼 네트워크 모듈이 컬러 영상을 입력으로 하여, 단계적 특성 정보를 추출하는 단계; (b) 상위 단계에서 하위 단계로 이어지고 각 단계별로 CNN(Convolutional Neural Network)으로 된 다수의 단계로 구성된 제2 레지듀얼 네트워크 모듈이 깊이 영상을 입력으로 하여, 단계적 특성 정보를 추출하는 단계; 및 (c) 다중 특성 결합 네트워크 모듈이 상기 제1 레지듀얼 네트워크 모듈 및 상기 제2 레지듀얼 네트워크 모듈의 각 단계별로 추출된 특성 정보를 결합하는 단계;를 포함하여 구성되고, 상기 (c) 단계는 상기 다중 특성 결합 네트워크 모듈은 각 단계별로, 상기 제1 레지듀얼 네트워크 모듈 및 상기 제2 레지듀얼 네트워크 모듈의 대응하는 단계에서 추출된 컬러 특성 정보 및 깊이 특성 정보 각각에 대하여, 특성 정보의 차원을 줄여서 파라미터의 급증을 완화하는 단계; 형태 결합을 위해 비선형 변형을 수행하는 단계; 및 다른 형태의 특성 정보를 적응적으로 결합하고, 덧셈을 위해 특성 값의 스케일을 조정하는 단계;를 구비하고, 스케일 조정된 컬러 특성 정보와 깊이 특성 정보를 덧셈 결합하는 것을 특징으로 한다.In order to achieve the above object, the semantic separation method of the indoor color depth image using the stepwise residual characteristic combining network device according to the present invention uses a single indoor color image (RGB) and a depth image thereof (A) a first residual network module consisting of a plurality of steps of a CNN (Convolutional Neural Network) that leads from a higher level to a lower level, Extracting stepwise characteristic information by using an image as an input; (b) extracting stepwise characteristic information by inputting a depth image by a second residual network module, which is composed of a plurality of steps of a CNN (Convolutional Neural Network), from an upper step to a lower step and each step; And (c) combining the extracted characteristic information for each step of the multi-characteristic coupling network module in each of the first and second residual network modules, wherein the step (c) comprises the steps of: Wherein the multi-characteristic combining network module reduces the dimension of the characteristic information for each of the color characteristic information and the depth characteristic information extracted at the corresponding steps of the first and second residual network modules in each step Mitigating a surge in parameters; Performing a non-linear transformation for shape combining; And adjusting the scale of the characteristic value for the addition, wherein the scaled color characteristic information and the depth characteristic information are additionally combined.

삭제delete

상기 단계적 레지듀얼 특성 결합 네트워크 장치를 이용한 실내 컬러-깊이 영상의 의미론적 분할 방법에 있어서, 상기 (c) 단계는 컬러 특성 정보와 깊이 특성 정보가 결합된 특성 정보에 문맥상의 정보를 합체시키는 단계를 더 구비하는 것을 특징으로 한다.The method of claim 1, wherein the step (c) comprises the step of combining the information on the context with the characteristic information in which the color characteristic information and the depth characteristic information are combined .

상기 단계적 레지듀얼 특성 결합 네트워크 장치를 이용한 실내 컬러-깊이 영상의 의미론적 분할 방법에 있어서, (d) RefineNet가 각 단계별로 하위 단계의 정제된 특성 정보와 상기 다중 특성 결합 네트워크의 대응하는 단계에서 결합된 특성 정보를 입력으로 하여 정제하는 단계;를 더 포함하여 구성된다.The method of claim 1, further comprising the steps of: (d) refining the color depth image by using the stepwise residual characteristic combining network device, And refining the inputted characteristic information as an input.

상기의 목적을 달성하기 위하여, 본 발명에 의한 컴퓨터로 읽을 수 있는 기록 매체는 상기 단계적 레지듀얼 특성 결합 네트워크 장치를 이용한 실내 컬러-깊이 영상의 의미론적 분할 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한다.In order to achieve the above object, a computer-readable recording medium according to the present invention records a program for causing a computer to execute a method of semantic division of an indoor color depth image using the stepwise residual characteristic combining network device .

상기의 목적을 달성하기 위하여, 본 발명에 의한 컴퓨터 프로그램은 상기 단계적 레지듀얼 특성 결합 네트워크 장치를 이용한 실내 컬러-깊이 영상의 의미론적 분할 방법을 컴퓨터에서 실행시키기 위하여 매체에 저장된다.In order to achieve the above object, a computer program according to the present invention is stored in a medium for executing a method of semantic division of indoor color depth images using the stepwise residual characteristic combining network device in a computer.

본 발명에 의하면, 실내 의미론적 분할을 위하여 컬러-깊이 영상에 대하여 단순히 순차적으로 특성을 추출하고 결합하고 개선하는 기능을 하는 것뿐만 아니라, 네트워크를 학습할 때 모든 과정이 효과적으로 이루어질 수 있는 방향으로 파라미터들이 자동으로 학습되는 것을 가능하게 한다.According to the present invention, not only is the function of extracting, combining, and improving characteristics of a color-depth image sequentially in order to achieve an indoor semantic division, but also a function To be automatically learned.

도 1은 컬러 영상에 대한 의미론적 분할의 RefineNet의 구조를 보여주는 구조도이다.
도 2는 컬러 영상과 깊이 영상을 함께 사용한 기존의 딥러닝 기반 기술들의 구조를 보여준다.
도 3은 실내 의미론적 분할을 위한 컬러-깊이 영상의 단계적 레지듀얼 특성 결합 네트워크 장치의 전체 구조를 보여준다.
도 4는 2개의 레이어를 갖는 일반적인 CNN 구조(a)와 2개의 레이어에 대해 skip connection 구조가 추가된 레지듀얼 네트워크 구조(b)를 보여주는 도면이다.
도 5는 도 3의 MMFNet에 해당하는 부분을 상세히 설명하는 구조도이다.
도 6은 의미론적 분할의 정성적인 결과를 보여준다.1 is a structural diagram showing the structure of RefineNet of semantic division on a color image.
FIG. 2 shows the structure of existing deep-learning-based techniques using a color image and a depth image together.
FIG. 3 shows the overall structure of a step-by-step residual characteristic combining network device for color depth images for interior semantic segmentation.
4 is a diagram showing a general CNN structure (a) having two layers and a residual network structure (b) having a skip connection structure added to two layers.
FIG. 5 is a structural diagram illustrating a portion corresponding to the MMFNet of FIG. 3 in detail.
Figure 6 shows the qualitative results of the semantic partitioning.

이하에서, 첨부된 도면을 참조하면서 본 발명의 바람직한 실시예에 대하여 상세히 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명은 컬러(RGB) 영상과 깊이 영상으로부터 단계별 특성 정보를 추출하는 부분(1 단계), 추출된 특성 정보를 단계마다 효과적으로 결합하는 부분 (2 단계), 단계별로 결합하여 얻은 특성 정보를 다시 단계적으로 결합해 가며 특성 정보를 더욱 선명하게 개선하는 부분(3단계)으로 나누어 볼 수 있다.The present invention relates to a method and apparatus for extracting characteristic information of a step (step 1) from a color (RGB) image and a depth image, a part for effectively combining the extracted characteristic information for each step (step 2) (3 levels) that improve the characteristics information more clearly.

이렇게 구성된 네트워크의 최종 특성 정보 맵으로부터 실내 영상의 의미론적 분할 결과를 얻을 수 있다. From the final property information map of the network thus constructed, the result of the semantic segmentation of the indoor image can be obtained.

본 발명에 의한 실내 의미론적 분할을 위한 컬러-깊이 영상의 단계적 레지듀얼 특성 결합 네트워크 장치(RDFNet이라 명명한다)의 전체 구조는 도 3에 도시된다.The overall structure of the stepped residual characteristic combining network device (called RDFNet) of the color depth image for the interior semantic partitioning according to the present invention is shown in FIG.

컬러-깊이 영상의 의미론적 분할에서 주된 이슈는 칼라 특성을 따라 깊이 특성을 효율적으로 추출하고, 의미론적 분할의 원하는 작업에 그와 같은 특성을 활용하는 것이다. RefineNet은 컬러 영상만에 대해 다른 레벨의 특성을 결합하는 수단을 제시하였지만, 본 발명에 의한 RDFNet은 컬러-깊이 영상에 대해 skip connection의 잇점을 유지하는 다중 형태 CNN 특성 결합을 위한 간단한 구조를 도입한다.The main issue in the semantic segmentation of color - depth images is the efficient extraction of depth characteristics according to color characteristics and the use of such characteristics in the desired work of semantic segmentation. RefineNet has suggested a means of combining different levels of properties for color images only, but RDFNet according to the present invention introduces a simple structure for combining multi-type CNN features that maintains the benefits of skip connection for color depth images .

RDFNet은 다중 형태 CNN 특성 결합을 처리하도록 하고, 결합된 특성 정제를 위해 RefineNet 블록들을 구비한다. 깊이 정보를 활용하는 기존의 네트워크와는 달리, RDFNet는 ResNet를 기초로 한 추가적인 깊은 깊이 특성 경로를 가지고 MMF 블록을 통해 다중 레벨 깊이 특성 정보를 완전히 활용하도록 설계된다.RDFNet allows to handle multi-type CNN feature combinations and has RefineNet blocks for combined feature refinement. Unlike existing networks that utilize depth information, RDFNet is designed to take full advantage of multi-level depth characterization information through MMF blocks with additional deep depth feature paths based on ResNet.

<1 단계(ResNet)><Step 1 (ResNet)>

도 3에서 CNN(Convolutional Neural Network)으로 표시되어 단계적으로 이어지는 네트워크 모듈은 컬러 영상과 깊이 영상으로부터 단계적 특성 정보를 추출하는 네트워크 모듈이다. 본 발명에서는 최근 영상 분류에서 가장 좋은 성능을 보였던 Kaiming He의 레지듀얼 네트워크((Residual Network : ResNet) 모듈의 구조를 활용하고 영상 분류 데이터셋에 미리 학습된 파라미터를 의미론적 영상 분할 목적에 맞게 수정하도록 학습 시킨다. 따라서, 본 발명에서 특성 정보를 추출하는 네트워크 모듈은 컬러 영상으로부터 단계적 특성 정보를 추출하는 제1 레지듀얼 네트워크 모듈과 깊이 영상으로부터 단계적 특성 정보를 추출하는 제2 레지듀얼 네트워크 모듈로 구성된다.In FIG. 3, the network module displayed as CNN (Convolutional Neural Network) is a network module for extracting stepwise characteristic information from a color image and a depth image. The present invention utilizes the structure of a residual network (ResNet) module of Kaiming He, which has recently exhibited the best performance in the image classification, and modifies the parameters previously learned in the image classification data set according to the purpose of semantic image segmentation The network module for extracting the characteristic information according to the present invention comprises a first residual network module for extracting the stepwise characteristic information from the color image and a second residual network module for extracting the step characteristic information from the depth image .

기존의 CNN 구조들은 역전파(backpropagation) 기법을 이용하여 학습을 진행하는데 CNN 구조가 깊을수록 전 단계의 그라디언트(gradient) 정도에 따라서 다음 단계의 그라디언트 값이 지수적으로 작아지는 현상이 발생할 수 있다. 따라서 출력 단계 근처의 그라디언트들은 값이 존재하지만, 입력 단계에서는 그 값이 0에 가까워지는 문제가 발생해 학습이 더 이상 진행되지 않는 문제가 발생한다(gradient vanishing problem). Conventional CNN structures are learning using backpropagation. As the CNN structure becomes deeper, the gradation value of the next step may be exponentially decreased according to the gradient of the previous step. Therefore, the gradients near the output stage have a value, but at the input stage, the value approaches zero, causing a problem that the learning does not proceed any more (gradient vanishing problem).

레지듀얼 네트워크(residual network) 모듈은 이러한 gradient vanishing 문제를 해결하여 깊은 네트워크 구조의 효과적인 학습을 가능하게 한 구조이다. 이 네트워크 모듈에서는 skip connection 이라는 구조를 이용하여 입력 값을 몇 개의 단계에 통과시킨 출력 값(도 4의 F(x))과 입력 값(도 4의 identity)의 합을 계산하여 결과 값을 얻을 수 있도록 구성한다. The residual network module solves this gradient vanishing problem and enables effective learning of the deep network structure. In this network module, the result of calculating the sum of the output value (F (x) in FIG. 4) and the input value (identity in FIG. 4) obtained by passing the input value through several stages using a skip connection structure is obtained .

도 4의 (a)가 기존의 일반적인 네트워크 구조인데, 이러한 구조일 때 두 단계를 거친 후 매핑된 결과는

로 표현할 수 있다. 도 4의 (b)는 레지듀얼 네트워크(residual network) 모듈 구조인데, 2개의 단계를 건너 뛴 skip connection을 활용한 것이 특징이다. 여기서

는

로 표현할 수 있다. 이러한 구성은 그라디언트를 계산하는데 있어 그 값이 너무 작아지지 않게 해주기 때문에 기존 CNN에서 발생되던 gradient vanishing 문제를 해결할 수 있다. 따라서 보다 많은 단계의 모델을 잘 학습시킬 수 있고 적은 수의 단계를 갖는 모델로도 높은 성능을 얻을 수 있다. 4 (a) shows a conventional network structure. In this structure, after two steps, the mapped result is

. Figure 4 (b) shows the structure of a residual network module, which uses a skip connection that skips two steps. here

The

. This configuration can solve the gradient vanishing problem that occurred in existing CNN because it does not become too small in calculating the gradient. Therefore, it is possible to learn the model with more steps and obtain the high performance even with the model with fewer steps.

<2 단계(MMFNet)><Step 2 (MMFNet)>

도 3에서 다중 특성 결합 네트워크(Multi Modal Feature Fusion Network: MMFNet) 모듈로 표시되는 부분은 제1 레지듀얼 네트워크 모듈 및 제2 레지듀얼 네트워크 모듈의 각 단계별로 추출된 특성 정보를 결합한다. 컬러 영상과 깊이 영상에서 추출한 특성들을 단순히 연결(concatenation)하거나 더하는 것만으로는 충분히 효과적으로 두 특성을 활용할 수 없다. 따라서 본 네트워크 모듈에서는 컬러 영상에서 여러 단계의 특성을 결합하는 데 우수성이 입증된 구조를 서로 다른 영상에서 추출한 특성 결합에 맞도록 변형하여 적용한다. 또한 기존의 기술들이 각각의 영상에서 추출한 특성들을 네트워크의 특정 단계에서 결합한 것과 달리, 본 발명에서는 1 단계에서 얻을 수 있는 중간 단계 특성들을 대부분 활용할 수 있도록 네트워크 모듈을 설계한다. 이렇게 설계된 구조는 이를 학습시키는 과정에서 자연스럽게 의미론적 분할에 중요한 특성들이 각각의 영상으로부터 자동적으로 추출, 결합되는 것을 가능하도록 해준다.In FIG. 3, the portion indicated by the MMFNet module combines extracted characteristic information for each step of the first and second residual network modules. Simply concatenating or adding the features extracted from the color and depth images can not utilize the two characteristics sufficiently effectively. Therefore, in this network module, the structure that has been proven to combine the characteristics of various stages in the color image is modified to suit the combination of characteristics extracted from different images. Also, unlike the existing technologies, the characteristics extracted from the respective images are combined at a specific stage of the network. In the present invention, the network module is designed to utilize most of the intermediate characteristics obtained in the first stage. The structure thus designed makes it possible to automatically extract and combine features important to semantic segmentation from each image naturally in the process of learning it.

MMFNet의 세부 구성은 도 5에 도시된다. The detailed configuration of MMFNet is shown in Fig.

칼라(RGB), 깊이에 대한 제1 레지듀얼 네트워크 모듈 및 제2 레지듀얼 네트워크 모듈의 특성 정보가 주어졌을 때, MMFNet은 먼저 하나의 콜볼루션 블록을 통해 각 특성 정보의 차원을 줄여서, 파라미터의 급증을 완화하면서 효율적인 훈련을 활성화한다. Given the characteristic information of the first and second residual network modules for color (RGB) and depth, MMFNet first reduces the dimension of each characteristic information through a single columbration block, And facilitates efficient training.

그 다음, 각 특성 정보는 RefineNet에서 처럼 2개의 레지듀얼 콜볼루션 유닛 (Residual Convolution Unit : RCU)와 하나의 콘볼루션 블록을 지나간다. MMFNet과 RefineNet에서는 RCU의 목적에서 차이가 있다. MMFNet에서의 RCU는 형태 결합을 위해 특별히 비선형 변형을 수행하도록 의도된다. 다른 형태(modality)에서의 두 특성은 RCU의 동작을 통해 상호 개선되도록 보상적으로 결합된다. MMFNet에서는, RefineNet에서 처럼 하위 레벨 특성에 고해상도를 도입함으로써 거친 상위 레벨 특성을 정제한다. Then, each property information passes through two residual convolution units (RCU) and one convolution block as in RefineNet. There are differences in the purpose of RCU in MMFNet and RefineNet. The RCU in MMFNet is specifically intended to perform nonlinear transformations for shape combining. The two characteristics in different modalities are compensated to be mutually improved through the operation of the RCU. In MMFNet, high-level features are refined by introducing high-resolution into low-level features such as RefineNet.

MMFNet에서의 다음의 추가적인 콘볼루션 블록은 다른 형태의 특성들을 적응적으로 결합하고, 덧셈을 위해 특성 값의 스케일을 적절히 조정함에 있어서 중요하다. The following additional convolution block in MMFNet is important in adaptively combining other types of characteristics and adjusting the scale of the property values appropriately for addition.

의미론적 분할에서 칼라 특성은 깊이 특성에 비하여 일반적으로 더 나은 분별력을 가지므로, MMFNet에서의 덧셈 결합은 보충적인 또는 잔여의 깊이 특성을 학습하도록 동작한다. 여기에서의 깊이 특성은 혼란스런 패턴을 식별할 수 있도록 RGB 특성을 개선할 것이다. 각 형태 특성의 중요성은 RCU 이후의 콘볼루션 블록에 있는 학습 가능한 파라미터에 의해 조절될 수 있다.The additive combination in MMFNet works to learn supplemental or residual depth characteristics, since the color properties in semantic splitting generally have better discrimination than depth properties. The depth properties here will improve the RGB characteristics to identify confusing patterns. The importance of each shape characteristic can be controlled by the learnable parameters in the convolution block after the RCU.

마지막으로, MMFNet에서는 결합된 특성 정보에 문맥상의 정보(Contextual Information )가 합체되도록 추가적인 레지듀얼 풀링 동작을 수행한다. 각 단계의 MMFNet에서 하나의 레지듀얼 풀링이면 충분하다. 더 강한 문맥 정보는 RefineNet 블록을 통하여 이어지는 다중 레벨 결합에서 더 결합될 수 있다.Finally, MMFNet performs an additional residual pooling operation to integrate contextual information into the combined property information. One residual pooling in the MMFNet at each step is sufficient. Stronger contextual information can be further combined in a subsequent multilevel combination through the RefineNet block.

MMFNet은 모든 단계를 통하여 skip connection을 가진 레지듀얼 학습을 도입함으로써 효율적인 다중 레벨 칼라(RGB), 깊이 특성 추출과 효율적인 종간 훈련(end-to-end training)이 가능하게 한다.MMFNet enables efficient multi-level color (RGB), depth feature extraction and efficient end-to-end training by introducing residual learning with skip connection through all steps.

<3 단계(RefineNet)> <Step 3 (RefineNet)>

도 3에서 RefineNet으로 표시되는 부분은 하위 단계의 정제된 특성 정보와 MMFNet 의 대응하는 단계에서 결합된 특성 정보를 입력으로 하여 다시 높은 해상도의 의미론적 분할 결과를 얻기 위해 개선해가는 부분이다. 단계별로 얻은 특성들은 서로 다른 해상도와 차원의 정보를 가지고 있으므로 이들을 단계적으로 잘 결합하는 것이 중요하다.In FIG. 3, the portion denoted by RefineNet is a part that is refined to obtain a high-resolution semantic division result by inputting the refined characteristic information of the lower level and the characteristic information combined at the corresponding step of MMFNet. It is important to combine them step by step because the characteristics obtained in stages have different resolution and dimensional information.

도 3의 RefineNet은 종래 기술에 의한 RefineNet의 구조를 그대로 활용할 수 있다. 하지만 네트워크 모듈의 입력으로 컬러 영상에서 뽑은 특성을 넣는 대신, 컬러 영상과 깊이 영상에서 효과적으로 추출되고 결합된 특성을 넣어준다는 점에서 차이가 있다.RefineNet of FIG. 3 can utilize the structure of RefineNet according to the prior art as it is. However, it differs from the input of the network module in that it extracts and combines the characteristics extracted from the color image and the depth image, instead of extracting the characteristics extracted from the color image.

도 3의 RefineNet은 다중 레벨 MMFNet 의 컬러 영상과 깊이 영상에서 추출되고 결합된 특성 정보와 이전에 정제된 특성 정보를 입력으로 받아들인다. 그와 같은 특성들은 도 1의 (b)에 도시된 일련의 하위 구성들(레지듀얼 콜볼루션 유닛, 다중 해상도 결합 블록, 체인드 레지듀얼 풀링)에 의해 정제되고 결합된다.RefineNet of FIG. 3 receives extracted and combined characteristic information and previously refined characteristic information from color and depth images of multi-level MMFNet as input. Such characteristics are refined and combined by a series of subcomponents (residual-call-convolution unit, multi-resolution combine block, chain-residual pooling) shown in Figure 1 (b).

도 1의 (c)에 도시된 레지듀얼 콜볼루션 유닛(Residual Convolution Unit : RCU)은 의미론적 분할을 위해 미리 훈련된 ResNet의 웨이트를 미세 조정하는 적응적인 콘볼루션 세트이다.The Residual Convolution Unit (RCU) shown in FIG. 1 (c) is an adaptive convolution set that fine-tunes the weight of ResNet pre-trained for semantic partitioning.

도 1의 (d)에 도시된 다중 해상도 결합 블록(Multiresolution Fusion Block)은 고해상도 특성 맵으로의 다중 경로 입력을 결합한다. 블록에서의 하나의 콘볼루션은 입력 적응을 위한 것으로, 특성 채널의 수를 맞추고 덧셈에 맞추어 특성 값의 스케일을 조정한다. The Multiresolution Fusion Block shown in FIG. 1 (d) combines multipath inputs into a high resolution feature map. One convolution in the block is for input adaptation, adjusting the number of characteristic channels and adjusting the scale of the characteristic value to add.

도 1의 (e)에 도시된 체인드 레지듀얼 풀링(Chained Residual Pooling : CRP)의 목적은 큰 영역으로부터의 문맥상의 정보를 인코드 하기 위한 것이다. 그 블록은 다중 풀링 블록의 체인으로 구성된다. 풀링 블록은 각각 하나의 맥스 풀링(max-pooling) 레이어와 하나의 콘볼루션 레이어로 구성된다. 풀링 동작은 근처로부터 액세스 되는 큰 활성화 값을 문맥상의 특성으로 전파하는 효과를 가진다. 추가적인 콘볼루션 레이어는 풀 된 특성의 중요성을 학습하는데, 풀 된 특성은 레지듀얼 커넥션(residual connection)을 통해 원래의 특성과 결합된다. The purpose of the Chained Residual Pooling (CRP) shown in FIG. 1 (e) is to encode information in a context from a large area. The block consists of a chain of multiple pooling blocks. Each pooling block consists of one max-pooling layer and one convolution layer. The pulling operation has the effect of propagating a large activation value accessed from a neighborhood to a contextual characteristic. The additional convolution layer learns the importance of the pooled property, which is combined with the original property through a residual connection.

RefineNet의 끝에는 결합된 특성 맵 상에서 비선형 동작을 도입하기 위한 추가적인 RCU가 있다.At the end of RefineNet there is an additional RCU for introducing nonlinear behavior on the combined property map.

본 발명에서 네트워크 모듈의 학습은 확률적 그라디언트 하강(stochastic Gradient descent)를 이용한 역 전파(back-propagation) 알고리듬으로 이루어진다.In the present invention, the learning of the network module is made of a back-propagation algorithm using a stochastic gradient descent.

<평가 결과> &Lt; Evaluation result >

본 발명에서는 NYUDv2와 SUN-RGBD 두 종류의 실내 영상 데이터셋에 대해 네트워크 모듈을 학습 시키고 이를 통해 얻은 의미론적 영상 분할 결과를 세 가지의 평가 기준으로 평가하였다. In the present invention, the network module is learned for two types of indoor image data set of NYUDv2 and SUN-RGBD, and the semantic image division result obtained through the learning is evaluated by three evaluation criteria.

첫 번째로는 픽셀별 분류 정확도를 측정하는 Pixel accuracy, 물체 종류별 정확도의 평균 값인 mean accuracy, 그리고 각 물체 종류별 실제 라벨과 겹치는 정도의 평균 값인 mean IoU를 사용하였다. First, we used pixel accuracy to measure pixel accuracy, mean accuracy, which is the average value of the accuracy of each object, and mean IoU, which is the average value of overlap with actual labels for each object type.

또한 이 수치를 표 1과 표 2에서 다른 주요 연구 그룹들의 결과와 비교하였다.This figure is also compared with the results of the other major study groups in Tables 1 and 2.

표 1은 NYUDv2 데이터셋에 대한 결과를 보여준다. 표 1에서 볼 수 있듯이, 본 발명(RDF-152 (ours))의 성능이 컬러 영상과 깊이 영상을 같이 활용한 기존 연구 결과(data: RGB-D)들보다 월등한 성능을 나타냈다. 또한 컬러 영상만을 사용하여 좋은 성능을 냈던 RefineNet[26]보다 컬러 영상과 깊이 영상을 동시에 활용한 본 발명이 훨씬 좋은 성능을 얻었음을 확인할 수 있다.Table 1 shows the results for the NYUDv2 dataset. As shown in Table 1, the performance of the present invention (RDF-152 (ours)) is superior to that of data (RGB-D) using color image and depth image. In addition, it can be seen that the present invention using the color image and the depth image at the same time has much better performance than RefineNet [26] which used only color image.

<SUN-RGBD 데이터셋에 대한 의미론적 영상 분할 결과 비교><Comparison of semantic image segmentation results for SUN-RGBD dataset>

표 2는 SUN-RGBD 데이터셋에 대해 실험한 다른 연구 그룹들의 결과와 본 발명의 결과 비교를 보여준다. 표 2에서도 마찬가지로 컬러 영상과 깊이 영상을 활용한 기존 연구(FuseNet)보다 본 발명이 월등한 성능을 보여주며, 컬러 영상만을 사용한 RefineNet보다 나은 성능으로 최고 성능을 보여주는 것을 알 수 있다.Table 2 shows a comparison of the results of the present invention with the results of other study groups tested on the SUN-RGBD dataset. Table 2 shows that the present invention is superior to the existing research using FuseNet using color image and depth image, and shows better performance than RefineNet using only color image.

표 1과 표 2에서 사용된 문헌은 다음과 같다.The documents used in Tables 1 and 2 are as follows.

[문헌 11] S. Gupta, R. Girshick, P. Arbel′aez, and J. Malik. Learning rich features from rgb-d images for object detection and segmentation. In Proc. ECCV, pages 345-360. Springer, 2014[Document 11] S. Gupta, R. Girshick, P. Arbel'aez, and J. Malik. Learning rich features from rgb-d images for object detection and segmentation. In Proc. ECCV, pages 345-360. Springer, 2014

[문헌 7] D. Eigen and R. Fergus. Predicting depth, surface normal and semantic labels with a common multi-scale convolutional architecture. In Proc. ICCV, pages 2650-2658, 2015[Literature 7] D. Eigen and R. Fergus. Predicting depth, surface normal and semantic labels with a common multi-scale convolutional architecture. In Proc. ICCV, pages 2650-2658, 2015

[문헌 30] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. CVPR, pages 3431-3440, 2015[Literature 30] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. CVPR, pages 3431-3440, 2015

[문헌 39] J. Wang, Z. Wang, D. Tao, S. See, and G. Wang. Learning common and specific features for rgb-d semantic segmentation with deconvolutional networks. In Proc. ECCV, pages 664-679. Springer, 2016.[Literature 39] J. Wang, Z. Wang, D. Tao, S. See, and G. Wang. Learning common and specific features for rgb-d semantic segmentation with deconvolutional networks. In Proc. ECCV, pages 664-679. Springer, 2016.

[문헌 27] G. Lin, C. Shen, A. v. d. Hengel, and I. Reid. Exploring context with deep structured models for semantic segmentation. arXiv preprint arXiv:1603.03183, 2016 [Literature 27] G. Lin, C. Shen, A. v. d. Hengel, and I. Reid. Exploring context with deep structured models for semantic segmentation. arXiv preprint arXiv: 1603.03183, 2016

[문헌 26] G. Lin, A. Milan, C. Shen, and I. Reid. RefineNet: Multipath refinement networks for high-resolution semantic segmentation. In CVPR, July 2017[26] G. Lin, A. Milan, C. Shen, and I. Reid. RefineNet: Multipath refinement networks for high-resolution semantic segmentation. In CVPR, July 2017

[문헌 32] X. Ren, L. Bo, and D. Fox. Rgb-(d) scene labeling: Features and algorithms. In Proc. CVPR, pages 2759-2766. IEEE, 2012.[32] X. Ren, L. Bo, and D. Fox. Rgb- (d) scene labeling: Features and algorithms. In Proc. CVPR, pages 2759-2766. IEEE, 2012.

[문헌 19] A. Kendall, V. Badrinarayanan, and R. Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoderdecoder architectures for scene understanding. arXiv preprint arXiv:1511.02680, 2015.[Literature 19] A. Kendall, V. Badrinarayanan, and R. Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoderdecoder architectures for scene understanding. arXiv preprint arXiv: 1511.02680, 2015.

[문헌 25] Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin. Lstmcf: Unifying context modeling and fusion with lstms for rgbd scene labeling. In Proc. ECCV, pages 541-557. Springer, 2016. [25] Z. Li, Y. Gan, X. Liang, Y. Yu, H. Cheng, and L. Lin. Lstmcf: Unifying context modeling and fusion with lstms for rgbd scene labeling. In Proc. ECCV, pages 541-557. Springer, 2016.

[문헌 13] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers. Fusenet: Incorporating depth into semantic segmentation via fusionbased cnn architecture. In Proc. ACCV, volume 2, 2016. [Literature 13] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers. Fusenet: Incorporating depth into semantic segmentation via fusionbased cnn architecture. In Proc. ACCV, volume 2, 2016.

[문헌 28] G. Lin, C. Shen, A. van den Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic segmentation. In Proc. CVPR, pages 3194-3203, 2016 [Literature 28] G. Lin, C. Shen, A. van den Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic segmentation. In Proc. CVPR, pages 3194-3203, 2016

도 6은 의미론적 분할의 정성적인 결과를 보여준다. Figure 6 shows the qualitative results of the semantic partitioning.

각 영상별로 왼쪽부터 차례대로 입력 영상, 실제 라벨, 컬러 영상만을 이용한 RefineNet의 결과, 본 발명의 결과이다. 물체의 분류에 따라 색상을 다르게 표현하였다. 컬러 영상만을 사용했을 때보다 훨씬 정확한 의미론적 분할 결과를 얻었음을 확인할 수 있다.The results of the present invention are the results of RefineNet using only the input image, the actual label, and the color image in order from left to right for each image. The colors are expressed differently according to the classification of objects. It can be confirmed that the result of semantic division is much more accurate than that of using only color image.

한편, 상술한 본 발명의 실시예는 개인용 컴퓨터를 포함한 범용 컴퓨터에서 사용되는 매체에 기록될 수 있다. 상기 매체는 마그네틱 기록매체(예를 들면, 롬, 플로피 디스크, 하드 디스크 등), 광학적 판독매체(예를 들면, 씨디롬, 디브이디 등) 및 전기적 기록매체(예를 들면, 플레쉬 메모리, 메모리 스틱 등)와 같은 기록매체를 포함한다.Meanwhile, the above-described embodiments of the present invention can be recorded in a medium used in a general-purpose computer including a personal computer. The medium may be a magnetic recording medium (e.g., ROM, floppy disk, hard disk, etc.), an optical reading medium (e.g. CD ROM, And the like.

이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예는 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.The present invention has been described with reference to the preferred embodiments. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The disclosed embodiments should, therefore, be considered in an illustrative rather than a restrictive sense. The scope of the present invention is defined by the appended claims rather than by the foregoing description, and all differences within the scope of equivalents thereof should be construed as being included in the present invention.

Claims

delete

1. A network apparatus for performing deep learning based pixel unit semantic division using a single indoor color image (RGB) and a depth image (Depth)
A first residual network module, which is composed of a plurality of steps of a Convolutional Neural Network (CNN) for each step, from an upper step to a lower step, and extracts stepwise characteristic information from a color image;
A second residual network module which is composed of a plurality of steps of a CNN (Convolutional Neural Network), which leads from a higher stage to a lower stage, and extracts the stepwise characteristic information from the depth image; And
(MMFNet) module that combines the extracted characteristic information for each step of the first and second residual network modules,
Wherein the multi-characteristic combining network module is configured to perform, for each step, the color characteristic information and the depth characteristic information extracted at corresponding steps of the first and second residual network modules,
A convolution block for reducing the dimension of the characteristic information to alleviate the surge of the parameter;
Two residual call polarization units for performing nonlinear transformation for shape combining; And
A convolution block for adaptively combining other types of characteristic information and adjusting a scale of characteristic values for addition,
Wherein the scaled color characteristic information and the depth characteristic information are additionally combined. &Lt; RTI ID = 0.0 > 10. < / RTI >

The method of claim 3,
And a residual pulling block for combining context information with characteristic information in which color characteristic information and depth characteristic information are combined.

4. The method according to claim 3,
The method of claim 1, further comprising refining the refinement by inputting the refinement characteristic information of the lower level and the characteristic information combined at the corresponding stage of the multi-characteristic combining network module. Residual characteristic combining network device.

delete

A method of performing deep-learning-based pixel-by-pixel semantic division using a single indoor color image (RGB) and a depth image thereof,
(a) extracting stepwise characteristic information by inputting a color image as a first residual network module, which is composed of a plurality of steps of a CNN (Convolutional Neural Network), which leads from a higher level to a lower level and has a CNN (Convolutional Neural Network)
(b) extracting stepwise characteristic information by inputting a depth image by a second residual network module, which is composed of a plurality of steps of a CNN (Convolutional Neural Network), from an upper step to a lower step and each step; And
(c) combining the extracted characteristic information for each step of the first residual network module and the second residual network module,
The step (c)
Wherein the multi-characteristic combining network module is configured to perform, for each step, the color characteristic information and the depth characteristic information extracted at corresponding steps of the first and second residual network modules,
Reducing the dimension of the characteristic information to alleviate the surge of the parameter;
Performing a non-linear transformation for shape combining; And
Adaptively combining other types of characteristic information and adjusting a scale of the characteristic value for addition,
Wherein the scaled color feature information and the depth feature information are additionally combined. &Lt; RTI ID = 0.0 > 10. < / RTI >

9. The method of claim 8, wherein step (c)
Further comprising the step of incorporating contextual information into the characteristic information in which the color characteristic information and the depth characteristic information are combined, in the context of the indoor color depth image.

9. The method of claim 8,
(d) refining the refined network by inputting the refined characteristic information of the lower level and the characteristic information combined at the corresponding stage of the multi-characteristic combining network module for each step, A Semantic Split Method for Indoor Color - Depth Imagery Using Feature Combining Network Devices.

A computer-readable recording medium having recorded thereon a program for causing a computer to execute a method of semantic division of indoor color depth images using a stepped residual characteristic combining network apparatus according to any one of claims 8 to 10.

A computer program stored on a medium for executing in a computer a method for semantic segmentation of indoor color depth images using a stepped residual characteristic combining network device according to any one of claims 8 to 10.