KR20190119261A

KR20190119261A - Apparatus and method for segmenting of semantic image using fully convolutional neural network based on multi scale image and multi scale dilated convolution

Info

Publication number: KR20190119261A
Application number: KR1020180042468A
Authority: KR
Inventors: 이상웅; 보둑미
Original assignee: 가천대학교 산학협력단
Priority date: 2018-04-12
Filing date: 2018-04-12
Publication date: 2019-10-22
Also published as: KR102109372B1

Abstract

The present invention relates to an apparatus for segmenting a semantic image using a fully convolutional neural network based on a multi scale image and multi scale extended convolution and a method thereof capable of quickly and accurately segmenting meaningful part from a specific image through a fully convolutional neural network having a cascade architecture of extended convolution and a multi scale image as an input.

Description

FIELD AND METHOD FOR SEGMENTING OF SEMANTIC IMAGE USING FULLY CONVOLUTIONAL NEURAL NETWORK BASED ON MULTI SCALE IMAGE AND MULTI SCALE DILATED CONVOLUTION }

본 발명은 멀티 스케일 이미지와 멀티 스케일 확장된 컨볼루션 기반의 완전 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 장치 및 그 방법에 관한 것으로, 더욱 상세하게는 확장된 컨볼루션의 캐스케이드 아키텍처와 멀티 스케일 이미지를 입력으로 하는 완전 컨볼루션 뉴럴 네트워크를 통해 특정 이미지로부터 의미 있는 부분을 신속하고 정확하게 세그먼테이션할 수 있도록 하는 멀티 스케일 이미지와 멀티 스케일 확장된 컨볼루션 기반의 완전 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 장치 및 그 방법에 관한 것이다.The present invention relates to a semantic image segmentation apparatus using a multi-scale image and a multi-scale extended convolution-based full convolutional neural network, and a method thereof. More particularly, the present invention relates to a cascade architecture of an extended convolution and a multi-scale image. Semantic Image Segmentation Apparatus and Method Using Full-Convolution Neural Network Based on Multi-Scale Image and Multi-Scale Extended Convolution Based on Full-Convolutional Neural Network It is about.

최근 이미지 처리 기술의 급속한 발전으로 인해 시맨틱 이미지 세그먼테이션(semantic segmentation) 기술이 급격하게 발전하고 있다.Recently, due to rapid development of image processing technology, semantic image segmentation technology is rapidly developing.

시맨틱 이미지 세그먼테이션 기술은 이미지로부터 의미 있는 부분을 분류하기 위한 것으로, 자율주행 자동차, 의료 분야 등과 같이 다양한 분야에 적용되어 장면 이해 및 객체 인식에 주목할 만한 기여를 하고 있다.Semantic image segmentation technology is used to classify meaningful parts from images. It is applied to various fields such as autonomous vehicles and medical fields, and contributes to notable scene understanding and object recognition.

딥 러닝(deep learning) 알고리즘이 개발되기 전 대부분의 시맨틱 이미지 세그먼테이션 기술은 조건부 랜덤 필드(conditional random fields) 방법에 크게 의존했다. 상기 조건부 랜덤 필드 방법은 몇 가지 미리 지정된 객체 클래스 중 어느 하나를 이미지 픽셀에 레이블링(label)하는 데 사용된다.Before the development of deep learning algorithms, most semantic image segmentation techniques rely heavily on conditional random fields. The conditional random field method is used to label any of several predefined object classes on image pixels.

결과적으로 조건부 랜덤 필드 방법은 다양한 객체를 동시에 인식하고 세그먼트할 수 있도록 한다. 조건부 랜덤필드를 사용하는 대부분의 기술은 이웃 픽셀 간의 레이블 어그리먼트(agreement)를 최대화하고 다양한 객체 클래스를 분류하기 위한 컨텍스트 정보 모델을 개발함으로써, 시맨틱 이미지 세그먼테이션 문제를 처리한다.As a result, the conditional random field method makes it possible to simultaneously recognize and segment various objects. Most techniques that use conditional randomfields deal with semantic image segmentation issues by developing a contextual information model for maximizing label aggregation between neighboring pixels and classifying various object classes.

일반적으로 조건부 랜덤 필드의 전형적인 모델은 각 픽셀의 유너리 포텐셜(unary potential)과 각 픽셀의 이웃하는 픽셀들에 대한 페어와이즈(pair wise) 포텐셜에 의해 계산된다.In general, a typical model of a conditional random field is calculated by the unitary potential of each pixel and the pair wise potential for neighboring pixels of each pixel.

또한 딥 러닝 알고리즘인 딥 컨볼루션 뉴럴 네트워크(deep convolutional neural network)가 개발되면서, 의미 있는 특징들의 계층을 자동으로 생성할 수 있도록 함으로써, 객체 탐지 및 이미지 인식 분야에서 획기적인 발전을 가져왔다.In addition, the development of deep learning algorithms, deep convolutional neural networks, has led to breakthroughs in object detection and image recognition by enabling the automatic generation of meaningful features.

특히, 딥 컨볼루션 뉴럴 네트워크 중 하나인 완전 컨볼루션 뉴럴 네트워크(fully convolutional neural network)는 조밀한 예측을 위한 계산 효율로 인한 장점으로 인해 시맨틱 이미지 세그먼테이션의 성능을 개선하기 위한 많은 노력과 연구가 진행 중에 있다.In particular, the fully convolutional neural network, one of the deep convolutional neural networks, is under great effort and research to improve the performance of semantic image segmentation due to the advantages of computational efficiency for dense prediction. have.

따라서 최근 시맨틱 이미지 세그먼테이션을 수행하기 위한 많은 종래의 방법들은 뉴럴 네트워크를 포함하는 완전 컨볼루션 뉴럴 네트워크를 기반으로 개발되고 있다.Accordingly, many conventional methods for performing semantic image segmentation have recently been developed based on a fully convolutional neural network including a neural network.

또한 시맨틱 이미지 세그먼테이션을 위한 종래의 방법들은 컨볼루션 뉴럴 네트워크에서 풀링 레이어(pooling layer) 및 서브 샘플링(sub sampling) 레이어로부터 선택되는 특징을 포함하며, 원본 이미지에서의 수용필드(receptive field)를 커버하기 위해 확장될 수 있다.Conventional methods for semantic image segmentation also include a feature selected from a pooling layer and a sub sampling layer in the convolutional neural network, covering the receptive field in the original image. Can be extended to

그러나 컨볼루션 뉴럴 네트워크를 이용한 종래의 시맨틱 이미지 세그먼테이션 방법은 원본 이미지의 해상도를 감소시키고, 이미지에서의 세부 특징(detail feature)과 로컬 특징(local feature)을 손실시키기 때문에 원본 이미지를 토대로 수행되는 시맨틱 이미지 세그먼테이션에 대한 정확도가 매우 낮은 문제점이 있다.However, the conventional semantic image segmentation method using the convolutional neural network reduces the resolution of the original image and loses detail and local features in the image, so that the semantic image based on the original image is performed. There is a problem in that the accuracy of segmentation is very low.

이에 따라 본 발명은 확장된 컨볼루션의 캐스케이드 아키텍처와 멀티 스케일 이미지 입력의 딥 네트워크 아키텍처를 가지는 완전 컨볼루션 뉴럴 네트워크를 제안하여, 원본 이미지에 대한 해상도를 감소시키는 종래의 시맨틱 이미지 세그먼테이션 방법의 문제점을 극복하여 멀티 스케일 이미지에 대한 특징들을 효율적 추출함과 동시에 네트워크 학습 파라미터의 증가를 억제하여 신속하고 정확하게 시맨틱 이미지 세그먼테이션을 수행할 수 있도록 하는 멀티 스케일 이미지와 멀티 스케일 확장 컨볼루션 기반의 완전 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 장치 및 그 방법을 제공하고자 한다.Accordingly, the present invention proposes a fully convolutional neural network having a cascade architecture of extended convolution and a deep network architecture of multi-scale image input, thereby overcoming the problems of the conventional semantic image segmentation method of reducing the resolution of the original image. A fully convolutional neural network based on multi-scale image and multi-scale extended convolution, which efficiently extracts features for multi-scale images and suppresses the increase of network learning parameters, enabling fast and accurate semantic image segmentation. The present invention provides a semantic image segmentation apparatus and a method thereof.

다음으로 본 발명의 기술분야에 존재하는 선행기술에 대하여 간단하게 설명하고, 이어서 본 발명이 상기 선행기술에 비해서 차별적으로 이루고자 하는 기술적 사항에 대해서 기술하고자 한다.Next, the prior art existing in the technical field of the present invention will be briefly described, and then the technical matters to be made differently from the prior art will be described.

먼저 비특허 문헌, 시맨틱 세그먼테이션을 위한 디컨볼루션 네트워크 학습 방법(arXiv:1505.04366 참조)은 사용하여 종래의 시맨틱 세그먼테이션 방법에 대한 문제점을 극복하기 위해 상위 컨볼루션 레이어(up-convolutional layer)를 사용한 것으로, 컨볼루션 네트워크의 풀링 레이어에서 다운 샘플링 과정을 통해 손실된 정보를 복구하여 비교적 시맨틱 세그먼테이션을 수행할 수 있도록 한다.First, the non-patent literature, Deconvolution Network Learning Method for Semantic Segmentation (see arXiv: 1505.04366), uses an up-convolutional layer to overcome the problems with the conventional semantic segmentation method. In the pooling layer of the convolutional network, the downsampling process recovers lost information so that relatively semantic segmentation can be performed.

그러나 상기 선행기술은 손실된 모든 정보를 복구하는 것이 아니라, 손실된 정보의 일부만을 복구하기 때문에 시맨틱 이미지 세그먼테이션의 정확도가 현저하게 떨어지는 문제점이 있다.However, the prior art has a problem that the accuracy of semantic image segmentation is remarkably inferior because it recovers only part of the lost information, rather than recovering all the lost information.

반면에 본 발명은 확장된 컨볼루션 캐스케이드 아키텍처와 멀티 스케일 이미지 입력을 위한 딥 네트워크 아키텍처를 결합하여, 시맨틱 이미지 세그먼테이션을 위한 멀티 스케일 이미지를 학습하는 과정에서 원본 이미지의 감소된 해상도와 상기 원본 이미지에 대한 손실되는 특징들을 효율적으로 복구하여, 시맨틱 이미지 세그먼테이션을 정확하게 수행할 수 있도록 하는 것으로, 상기 선행기술은 이러한 본 발명의 기술적 특징을 기재하거나 시사하고 있지 않다.On the other hand, the present invention combines an extended convolution cascade architecture and a deep network architecture for multi-scale image input to reduce the resolution of the original image and the original image in the process of learning the multi-scale image for semantic image segmentation. In order to efficiently recover lost features, so that semantic image segmentation can be performed accurately, the prior art does not describe or suggest such technical features of the present invention.

또한 비특허 문헌, 딥 컨볼루션 네트워크와 완전 연결된 조건부 랜덤 필드를 이용한 시맨틱 이미지 세그먼테이션(ICLR 참조)은 원본 이미지에 대한 해상도가 감소되는 것을 방지하기 위해, 컨볼루션 네트워크의 몇몇 레이어에서 다운 샘플링 오퍼레이션을 사용하지 않고 보다 밀도가 높은 특징 맵을 추출함으로써, 시맨틱 이미지 세그먼테이션을 수행할 수 있도록 한다. In addition, non-patent literature, Semantic Image Segmentation (see ICLR) using conditional random fields fully connected to deep convolution networks, uses downsampling operations at several layers of the convolutional network to prevent the resolution of the original image from decreasing. By extracting a more dense feature map, the semantic image segmentation can be performed.

그러나 상기 선행기술은 시맨틱 이미지 세그먼테이션을 수행하기 위한 학습과정에 있어서, 엄청난 양의 입력 파라미터 때문에 매우 높은 계산 비용을 필요로 하는 문제점이 있다.However, the prior art has a problem in that the learning process for performing semantic image segmentation requires a very high computational cost due to the huge amount of input parameters.

반면에 본 발명은 상기 확장된 컨볼루션 캐스케이드 아키텍처를 통해 학습 파라미터의 수를 증가시키지 않으면서 멀티 스케일 이미지에 대한 특징을 효율적으로 추출함으로써, 원본 이미지에 대한 시맨틱 이미지 세그먼테이션을 정확하고 신속하게 수행할 수 있도록 하는 것으로, 상기 선행기술은 이러한 본 발명의 기술적 특징에 대한 구성이 전혀 제시되어 있지 않으며, 이에 대한 그 어떠한 암시도 되어 있지 않다.On the other hand, the present invention can efficiently and quickly perform semantic image segmentation on the original image by efficiently extracting the features of the multi-scale image without increasing the number of learning parameters through the extended convolution cascade architecture. As such, the prior art does not present any configuration of the technical features of the present invention, and does not imply any indication thereof.

본 발명은 상기와 같은 문제점을 해결하기 위해 창작된 것으로서, 멀티 스케일 이미지를 입력으로 하는 적어도 하나 이상의 서브 네트워크로 구성되는 완전 컨볼루션 뉴럴 네트워크를 통해 멀티 스케일 이미지를 학습하여 원본 이미지로부터 의미 있는 부분을 분류하기 위한 시맨틱 이미지 세그먼테이션을 신속하고 정확하게 수행할 수 있도록 하는 멀티 스케일 이미지와 멀티 스케일 확장된 컨볼루션 기반의 완전 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 장치 및 그 방법을 제공하는 것을 그 목적으로 한다.The present invention was created to solve the above problems, and learns a meaningful portion from the original image by learning the multi-scale image through a full convolutional neural network composed of at least one sub-network that takes a multi-scale image as an input. It is an object of the present invention to provide a semantic image segmentation apparatus and a method using a full scale convolution neural network based on multi-scale image and multi-scale extended convolution to enable semantic image segmentation to be classified quickly and accurately.

또한 본 발명은 완전 컨볼루션 뉴럴 네트워크에 확장된 컨볼루션을 적용하여 학습 파라미터의 증가를 방지하고, 멀티 스케일 이미지를 학습하는 과정에서 감소되는 공간 해상도를 효과적으로 복구함으로써, 상기 시맨틱 이미지 세그먼테이션의 정확도를 현저하게 향상시킬 수 있도록 하는 멀티 스케일 이미지와 멀티 스케일 확장된 컨볼루션 기반의 완전 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 장치 및 그 방법을 제공하는 것을 또 다른 목적으로 한다. In addition, the present invention applies extended convolution to a full convolutional neural network to prevent an increase in learning parameters and to effectively recover a reduced spatial resolution in the process of learning a multi-scale image, thereby improving the accuracy of the semantic image segmentation. Another object of the present invention is to provide a semantic image segmentation device and a method using a multi-scale image and a multi-scale extended convolution-based fully convolutional neural network.

또한 본 발명은 완전 컨볼루션 뉴럴 네트워크에 완전 연결된 조건부 랜덤 필드 모델을 적용하여, 상기 멀티 스케일 이미지에 대한 객체 식별 및 픽셀 수준 객체 위치를 효과적으로 파악할 수 있도록 함으로써, 상기 시맨틱 이미지 세그먼테이션의 성능을 현저하게 향상시킬 수 있도록 하는 멀티 스케일 이미지와 멀티 스케일 확장된 컨볼루션 기반의 완전 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 장치 및 그 방법을 제공하는 것을 또 다른 목적으로 한다. In addition, the present invention by applying a conditional random field model fully connected to a full convolutional neural network, it is possible to effectively identify the object identification and pixel level object position for the multi-scale image, thereby significantly improving the performance of the semantic image segmentation Another object of the present invention is to provide a semantic image segmentation apparatus using the multi-scale image and the multi-scale extended convolution-based full convolutional neural network, and a method thereof.

본 발명의 일 실시예에 따른 시맨틱 이미지 세그먼테이션 장치는 학습데이터를 전처리하여 상기 학습데이터의 원본 이미지에 대한 멀티 스케일 이미지를 생성하는 전처리부, 상기 생성한 멀티 스케일 이미지를 학습하여 시맨틱 이미지 세그먼테이션용 학습모델을 생성하는 시맨틱 이미지 세그먼테이션용 학습모델 생성부를 포함하며, 상기 학습은 멀티 스케일 이미지를 입력으로 하는 컨볼루션 뉴럴 네트워크로 구성되는 복수의 서브 네트워크 및 멀티 스케일 확장된 컨볼루션 캐스케이드 아키텍처를 포함하는 완전 컨볼루션 뉴럴 네트워크를 통해 수행되는 것을 특징으로 한다.The semantic image segmentation apparatus according to an embodiment of the present invention is a preprocessor for pre-processing the training data to generate a multi-scale image of the original image of the training data, learning model for the semantic image segmentation by learning the generated multi-scale image A learning model generator for semantic image segmentation for generating a multi-scale convolutional cascade architecture includes a plurality of sub-networks and a multi-scale extended convolution cascade architecture. It is characterized in that performed through the neural network.

또한 시맨틱 이미지 세그먼테이션 장치는, 상기 멀티 스케일 확장된 컨볼루션 캐스케이드 아키텍처를 통해 상기 각 서브 네트워크로부터 출력되는 특징맵에 대한 해상도를 복구하여, 고해상도의 최종 특징맵을 생성하며, 상기 생성한 각각의 최종 특징맵을 동일한 해상도로 재조정하는 것을 특징으로 한다.In addition, the semantic image segmentation device restores the resolution of the feature map output from each of the sub-networks through the multi-scale extended convolution cascade architecture to generate a high-resolution final feature map, and each of the generated final features And readjust the map to the same resolution.

또한 상기 멀티 스케일 확장된 컨볼루션 캐스케이드 아키텍처는, 캐스케이드 형태로 연결되어 확장된 컨볼루션을 수행하는 복수의 확장된 컨볼루션 레이어를 포함하며, 상기 각 확장된 컨볼루션 레이어는 서로 다른 레이트(rate)를 가지되, 이전의 확장된 컨볼루션 레이어보다 더 큰 레이트를 가지는 것을 특징으로 한다.The multi-scale extended convolution cascade architecture also includes a plurality of extended convolution layers connected in cascade to perform extended convolution, each extended convolution layer having a different rate. It is characterized by having a higher rate than the previous extended convolution layer.

또한 상기 완전 컨볼루션 뉴럴네트워크는, 상기 멀티 스케일 확장된 컨볼루션 캐스케이드 아키텍처로부터 출력되는 모든 특징맵을 통합하여 객체에 대한 경계를 추출하는 맥스아웃 레이어 및 상기 추출한 객체에 대한 경계를 복구하기 위한 완전 연결된 조건부 랜덤 필드 모델을 더 포함하는 것을 특징으로 한다.In addition, the fully convolutional neural network is configured to integrate all feature maps output from the multi-scale extended convolution cascade architecture to extract a boundary for an object and a fully connected layer for restoring the boundary for the extracted object. And further comprising a conditional random field model.

또한 상기 시맨틱 이미지 세그먼테이션 장치는, 시맨틱 이미지 세그먼테이션을 위한 특정 이미지가 입력되는 경우, 상기 전처리부를 제어하여 상기 특정 이미지에 대한 멀티 스케일 이미지를 생성하고, 상기 생성한 멀티 스케일 이미지를 상기 시맨틱 이미지 세그먼테이션용 학습모델에 적용함으로써, 상기 특정 이미지에 대한 시맨틱 이미지 세그먼테이션을 수행하는 시맨틱 이미지 세그먼테이션부를 더 포함하는 것을 특징으로 한다.The semantic image segmentation apparatus, when a specific image for semantic image segmentation is input, controls the preprocessor to generate a multi-scale image of the specific image, and learns the generated multi-scale image for the semantic image segmentation. The method may further include a semantic image segmentation unit configured to perform semantic image segmentation on the specific image.

아울러 본 발명의 일 실시예에 따른 시맨틱 이미지 세그먼테이션 방법은 학습데이터를 전처리하여 상기 학습데이터의 원본 이미지에 대한 멀티 스케일 이미지를 생성하는 전처리 단계, 상기 생성한 멀티 스케일 이미지를 학습하여 시맨틱 이미지 세그먼테이션용 학습모델을 생성하는 시맨틱 이미지 세그먼테이션용 학습모델 생성 단계를 포함하며, 상기 학습은 멀티 스케일 이미지를 입력으로 하는 컨볼루션 뉴럴 네트워크로 구성되는 복수의 서브 네트워크 및 멀티 스케일 확장된 컨볼루션 캐스케이드 아키텍처를 포함하는 완전 컨볼루션 뉴럴 네트워크를 통해 수행되는 것을 특징으로 한다.In addition, the semantic image segmentation method according to an embodiment of the present invention is a preprocessing step of pre-processing the training data to generate a multi-scale image of the original image of the training data, learning the semantic image segmentation by learning the generated multi-scale image Generating a training model for semantic image segmentation that generates a model, the training comprising a plurality of sub-networks consisting of a convolutional neural network with multi-scale images as input and a multi-scale extended convolution cascade architecture. Characterized in that it is performed via a convolutional neural network.

또한 시맨틱 이미지 세그먼테이션 방법은, 상기 멀티 스케일 확장된 컨볼루션 캐스케이드 아키텍처를 통해 상기 각 서브 네트워크로부터 출력되는 특징맵에 대한 해상도를 복구하여, 고해상도의 최종 특징맵을 생성하며, 상기 생성한 각각의 최종 특징맵을 동일한 해상도로 재조정하는 것을 특징으로 한다.In addition, the semantic image segmentation method recovers the resolution of the feature map output from each of the sub-networks through the multi-scale extended convolution cascade architecture to generate a high-resolution final feature map, and each of the generated final features And readjust the map to the same resolution.

또한 상기 시맨틱 이미지 세그먼테이션 방법은, 시맨틱 이미지 세그먼테이션을 위한 특정 이미지가 입력되는 경우, 상기 전처리부를 제어하여 상기 특정 이미지에 대한 멀티 스케일 이미지를 생성하고, 상기 생성한 멀티 스케일 이미지를 상기 시맨틱 이미지 세그먼테이션용 학습모델에 적용함으로써, 상기 특정 이미지에 대한 시맨틱 이미지 세그먼테이션을 수행하는 시맨틱 이미지 세그먼테이션 단계를 더 포함하는 것을 특징으로 한다.In addition, the semantic image segmentation method, when a specific image for semantic image segmentation is input, by controlling the pre-processing unit to generate a multi-scale image for the specific image, learning the semantic image segmentation for the generated multi-scale image The method may further include a semantic image segmentation step of performing semantic image segmentation on the specific image.

이상에서와 같이 본 발명의 멀티 스케일 이미지와 멀티 스케일 확장된 컨볼루션 기반의 완전 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 장치 및 그 방법에 따르면 멀티 스케일 확장된 컨볼루션 캐스케이드 아키텍처와 멀티 스케일 이미지 입력의 딥 네트워크 아키텍처를 결합한 딥 컨볼루션 뉴럴 네트워크를 통해, 멀티 스케일 이미지부터 멀티 스케일 특징을 효율적으로 추출하고, 멀티 스케일 이미지의 해상도 감소를 복구함과 동시에 학습 파라미터의 증가를 억제함으로써, 시맨틱 이미지 세그먼테이션을 신속하고 정확하게 수행할 수 있도록 하는 효과가 있다.As described above, the semantic image segmentation apparatus using the multi-scale image and the multi-scale extended convolution-based full convolutional neural network, and the method according to the multi-scale extended convolution cascade architecture and multi-scale image input deep Deep convolutional neural networks, combined with a network architecture, speed semantic image segmentation by efficiently extracting multiscale features from multiscale images, restoring reduced resolution of multiscale images, and suppressing increases in learning parameters. It has the effect of performing correctly.

또한 본 발명은 딥 컨볼루션 뉴럴 네트워크에 완전 연결된 조건부 랜덤 필드 모델을 적용함으로써, 격리된 오탐(false positive)을 효과적으로 제거하여 멀티 스케일 이미지에 포함되는 객체의 경계를 따른 세그먼테이션 예측을 향상시켜 시맨틱 이미지 세그먼테이션에 대한 정확도를 현저하게 향상시킬 수 있도록 하는 효과가 있다. In addition, the present invention applies a conditional random field model that is fully connected to the deep convolutional neural network, thereby effectively removing isolated false positives, thereby improving segmentation prediction along the boundary of an object included in a multiscale image, thereby improving semantic image segmentation. There is an effect that can significantly improve the accuracy for.

도 1은 종래의 기술에 따른 딥 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 방법을 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른 멀티 스케일 이미지와 멀티 스케일 확장된 컨볼루션 기반의 완전 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 장치 및 그 방법을 개략적으로 설명하기 위한 개념도이다.
도 3은 본 발명의 일 실시예에 따른 서로 다른 레이트를 가지는 멀티 스케일 확장된 컨볼루션 캐스케이드를 이용한 특징 추출을 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른 멀티 스케일 확장된 컨볼루션을 가지는 딥 컨볼루션 뉴럴 네트워크를 나타낸 도면이다.
도 5는 본 발명의 일 실시예에 따른 완전 컨볼루션 뉴럴 네트워크를 나타낸 도면이다.
도 6은 본 발명의 일 실시예에 따른 완전 컨볼루션 뉴럴 네트워크에 포함되는 맥스아웃 레이어를 설명하기 위해 나타낸 도면이다.
도 7은 본 발명의 일 실시예에 따른 시맨틱 이미지 세그먼테이션 장치의 구성을 나타낸 블록도이다.
도 8a는 본 발명의 일 실시예에 따른 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크의 구조를 나타낸 도면이다.
도 8b는 본 발명의 일 실시예에 따른 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크의 구조를 나타낸 도면이다.
도 9는 본 발명의 일 실시예에 따른 PASCAL VOC 2012 데이터세트를 이용하여 완전 컨볼루션 뉴럴 네트워크를 통한 시맨틱 이미지 세그먼테이션의 정확도와 타 방법을 통한 시맨틱 이미지 세그먼테이션의 정확도를 비교한 도면이다.
도 10은 본 발명의 일 실시예에 따른 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크와 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크의 처리속도를 비교한 도면이다.
도 11은 본 발명의 일 실시예에 따른 완전 연결된 조건부 랜덤 필드 모델을 적용하여 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크와 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크의 성능을 비교한 도면이다.
도 12는 본 발명의 일 실시예에 따른 사람과 사물의 상호작용과 사람과 사람의 상호작용을 포함하는 액티비티를 강조하는 시맨틱 이미지의 세그먼테이션의 일례를 나타낸 도면이다.
도 13은 본 발명의 일 실시예에 따른 동물에 대한 시맨틱 이미지의 세그먼테이션에 대한 일례를 나타낸 도면이다.
도 14는 발명의 일 실시예에 따른 객체에 대한 시맨틱 이미지의 세그먼테이션에 대한 일례를 나타낸 도면이다.
도 15a는 본 발명의 일 실시예에 따른 흑색종 검출을 위한 피부 병변 데이터 세트를 이용하여, 완전 컨볼루션 뉴럴 네트워크와 타 방법과의 성능을 비교한 도면이다.
도 15b는 본 발명의 일 실시예에 따른 완전 컨볼루션 뉴럴 네트워크를 통해 흑색종 검출을 위한 피부 병변 데이터 세트의 시맨틱 이미지 세그먼테이션 결과를 나타낸 도면이다.
도 16은 본 발명의 일 실시예에 따른 시맨틱 이미지 세그먼테이션을 수행하는 절차를 나타낸 흐름도이다.1 is a diagram illustrating a semantic image segmentation method using a deep convolution neural network according to the related art.
FIG. 2 is a conceptual diagram schematically illustrating a semantic image segmentation apparatus and a method using a multi-scale image and a multi-scale extended convolution based full convolution neural network according to an embodiment of the present invention.
3 is a diagram illustrating feature extraction using a multi-scale extended convolution cascade having different rates according to an embodiment of the present invention.
4 illustrates a deep convolutional neural network having a multiscale extended convolution according to an embodiment of the present invention.
5 illustrates a full convolutional neural network according to an embodiment of the present invention.
6 is a diagram illustrating a max out layer included in a full convolutional neural network according to an embodiment of the present invention.
7 is a block diagram illustrating a configuration of a semantic image segmentation apparatus according to an embodiment of the present invention.
8A is a diagram illustrating the structure of a ResNet-101 based fully convolutional neural network according to an embodiment of the present invention.
8B is a diagram showing the structure of a VGG-16 based fully convolutional neural network according to an embodiment of the present invention.
FIG. 9 is a diagram comparing accuracy of semantic image segmentation through a full convolutional neural network and accuracy of semantic image segmentation using another method using the PASCAL VOC 2012 dataset according to an embodiment of the present invention.
FIG. 10 is a diagram comparing processing speeds of a VGG-16 based fully convolutional neural network and a ResNet-101 based fully convolutional neural network according to an embodiment of the present invention.
FIG. 11 is a diagram comparing the performance of a ResNet-101 based fully convolutional neural network and a VGG-16 based fully convolutional neural network by applying a fully connected conditional random field model according to an embodiment of the present invention.
FIG. 12 is a diagram illustrating an example of segmentation of a semantic image emphasizing an activity including a person-object interaction and a person-person interaction according to an embodiment of the present invention.
FIG. 13 is a diagram illustrating an example of segmentation of a semantic image for an animal according to an embodiment of the present invention. FIG.
14 illustrates an example of segmentation of a semantic image of an object according to an embodiment of the present invention.
FIG. 15A is a diagram comparing the performance of a complete convolutional neural network with other methods using a skin lesion data set for melanoma detection according to an embodiment of the present invention.
FIG. 15B is a diagram showing semantic image segmentation results of a skin lesion data set for melanoma detection through a full convolutional neural network according to an embodiment of the present invention. FIG.
16 is a flowchart illustrating a procedure of performing semantic image segmentation according to an embodiment of the present invention.

이하, 첨부한 도면을 참조하여 본 발명의 멀티 스케일 이미지와 멀티 스케일 확장 컨볼루션 기반의 완전 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 장치 및 그 방법에 대한 바람직한 실시 예를 상세히 설명한다. 각 도면에 제시된 동일한 참조부호는 동일한 부재를 나타낸다. 또한 본 발명의 실시 예들에 대해서 특정한 구조적 내지 기능적 설명들은 단지 본 발명에 따른 실시 예를 설명하기 위한 목적으로 예시된 것으로, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는 것이 바람직하다.Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the semantic image segmentation apparatus and method using a full scale convolution neural network based on the multi-scale image and multi-scale extended convolution. Like reference numerals in the drawings denote like elements. In addition, specific structural to functional descriptions of the embodiments of the present invention are only illustrated for the purpose of describing the embodiments according to the present invention, and unless otherwise defined, all terms used herein including technical or scientific terms These have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms such as those defined in the commonly used dictionaries should be construed as having meanings consistent with the meanings in the context of the related art, and are not construed in ideal or excessively formal meanings unless expressly defined herein. It is preferable not to.

도 1은 종래의 기술에 따른 딥 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 방법을 설명하기 위한 도면이다.1 is a diagram illustrating a semantic image segmentation method using a deep convolution neural network according to the related art.

도 1에 도시한 봐와 같이, 종래의 기술에 따른 딥 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 방법은, 복수의 컨볼루션 레이어와 서브 샘플링을 수행하기 위한 맥스 풀링 레이어(max-pooling layer)로 구성되어 입력 이미지를 학습함으로써, 시맨틱 이미지 세그먼테이션을 위한 학습모델을 생성한다.As shown in FIG. 1, the semantic image segmentation method using the deep convolutional neural network according to the related art includes a plurality of convolution layers and a max-pooling layer for performing subsampling. Learning the input image to generate a learning model for semantic image segmentation.

또한 종래의 기술에 따른 딥 컨볼루션 뉴럴 네트워크는 입력 이미지를 다운샘플링하고, 상기 다운 샘플링한 이미지에 대한 차원을 줄이기 위해 맥스 풀링 레이어와 스트라이드 연산자(stride operator)가 반복적으로 사용함으로써, 학습 파라미터의 수를 감소시켜 계산 비용을 줄인다.In addition, the deep convolutional neural network according to the prior art uses the max pooling layer and the stride operator repeatedly to downsample the input image and reduce the dimension of the downsampled image, thereby reducing the number of learning parameters. Reduce the cost of computation.

그러나, 딥 컨볼루션 뉴럴 네트워크를 사용하는 종래의 기술은 원본 이미지에 대한 공간 해상도(spatial resolution)를 크게 감소시키기 때문에 시맨틱 이미지 세그먼테이션의 정확도가 현저하게 떨어지는 문제점이 있다.However, the conventional technique using the deep convolutional neural network has a problem that the accuracy of semantic image segmentation is remarkably degraded because it greatly reduces the spatial resolution of the original image.

따라서 본 발명은 확장된 컨볼루션 레이어를 포함하는 완전 컨볼루션 뉴럴 네트워크를 제안함으로써, 학습 파라미터의 수를 증가시키지 않으면서도 각 컨볼루션 레이어에 입력되는 입력 매트릭스의 해상도를 감소시키지 않도록 하여 정확하고 신속하게 시맨틱 이미지 세그먼테이션을 수행할 수 있도록 하는 멀티 스케일 이미지와 멀티 스케일 확장 컨볼루션 기반의 완전 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 장치 및 그 방법을 제공하고자 한다.Therefore, the present invention proposes a full convolutional neural network including an extended convolutional layer, thereby accurately and quickly by not reducing the resolution of the input matrix input to each convolutional layer without increasing the number of learning parameters. The present invention provides a semantic image segmentation device and a method using a full scale convolution neural network based on multi-scale image and multi-scale extended convolution to perform semantic image segmentation.

도 2는 본 발명의 일 실시예에 따른 멀티 스케일 이미지와 멀티 스케일 확장 컨볼루션 기반의 완전 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 장치 및 그 방법을 개략적으로 설명하기 위한 개념도이다.FIG. 2 is a conceptual diagram schematically illustrating a semantic image segmentation apparatus using the full scale convolution neural network based on a multi-scale image and a multi-scale extended convolution according to an embodiment of the present invention, and a method thereof.

도 2에 도시한 바와 같이, 본 발명의 일 실시예에 따른 멀티 스케일 이미지와 확장 컨볼루션 기반의 완전 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 장치(이하, 시맨틱 이미지 세그먼테이션 장치라 칭함)(100)는 데이터베이스(300)의 학습데이터 데이터베이스(310)에 저장되는 학습데이터를 학습하여 시맨틱 이미지 세그먼테이션을 위한 학습모델을 생성한다.As shown in FIG. 2, a semantic image segmentation apparatus (hereinafter, referred to as a semantic image segmentation apparatus) using a multi-scale image and a full-convolution neural network based on extended convolution according to an embodiment of the present invention may be described. The training data stored in the training data database 310 of the database 300 is learned to generate a training model for semantic image segmentation.

또한 상기 학습데이터는 사람, 병, 자동차, 기차 등과 같은 다양한 객체를 포함하는 이미지뿐만 아니라 피부 병변을 포함하는 피부 이미지와 같이 의료분야에서 사용되는 의료 이미지를 포함한다.In addition, the learning data includes medical images used in the medical field, such as skin images including skin lesions, as well as images including various objects such as people, bottles, cars, and trains.

즉, 시맨틱 이미지 세그먼테이션 장치(100)는 학습데이터를 구성하는 복수의 이미지를 학습하여, 특정 이미지로부터 사람, 병, 자동차, 기차, 환부(diseased spot)등과 같은 의미 있는 부분을 세그먼테이션할 수 있도록 한다.That is, the semantic image segmentation apparatus 100 may learn a plurality of images constituting the learning data, and may segment meaningful parts such as a person, a bottle, a car, a train, a damaged spot, and the like from a specific image.

또한 상기 학습은 완전 컨볼루션 뉴럴 네트워크를 토대로 수행된다. 또한 상기 완전 컨볼루션 뉴럴 네트워크는 상기 학습데이터의 원본 이미지에 대한 멀티 스케일 이미지를 각각 입력받는 적어도 하나 이상의 서브 네트워크(sub network)로 구성되며, 상기 각 서브네트워크의 끝단에는 확장된 컨볼루션 캐스케이드(cascade) 아키텍처가 결합된다.The learning is also performed on a fully convolutional neural network. In addition, the full convolutional neural network is composed of at least one sub-network (sub-network) for receiving a multi-scale image of the original image of the training data, respectively, an extended convolution cascade (end) at the end of each sub-network Architecture is combined.

한편 상기 멀티 스케일 이미지는 전처리 과정을 통해 상기 원본 이미지를 각기 다른 스케일로 재조정된 복수의 이미지를 의미하며, 상기 환전 컨볼루션 뉴럴 네트워크는 상기 각 서브네트워크를 통해 상기 재조정된 각각의 이미지를 입력받음으로써, 상기 학습데이터에 대한 학습을 수행하게 된다.Meanwhile, the multi-scale image refers to a plurality of images that are rescaled to different scales through the preprocessing process, and the exchange convolutional neural network receives the rescaled images through the respective subnetworks. In addition, the learning data is performed.

또한 완전 컨볼루션 뉴럴 네트워크는 완전 연결된 조건부 랜덤 필드 모델과 결합되어, 이를 통해 격리된 오탐(false positive)을 효과적으로 제거하여 시맨틱 세그먼테이션의 정확성을 향상시킬 수 있도록 한다.The fully convolutional neural network is also combined with a fully connected conditional random field model, which effectively eliminates isolated false positives, improving the accuracy of semantic segmentation.

한편 상기에서 설명한 본 발명의 일 실시예에 따른 완전 컨볼루션 뉴럴 네트워크의 구조는 도 5를 참조하여 상세히 설명하도록 한다.Meanwhile, the structure of the full convolutional neural network according to the embodiment of the present invention described above will be described in detail with reference to FIG. 5.

또한 시맨틱 이미지 세그먼테이션 장치(100)는 학습데이터를 학습하여 생성한 학습모델을 학습모델 데이터베이스(320)에 저장한다.In addition, the semantic image segmentation apparatus 100 stores the learning model generated by learning the training data in the learning model database 320.

또한 시맨틱 이미지 세그먼테이션 장치(100)는 시맨틱 이미지 세그먼테이션을 위한 특정 이미지가 입력되는 경우, 상기 전처리 과정을 수행하여 상기 특정 이미지에 대한 멀티 스케일 이미지를 생성한다.In addition, when a specific image for semantic image segmentation is input, the semantic image segmentation apparatus 100 performs the preprocessing to generate a multi-scale image of the specific image.

이후 시맨틱 이미지 세그먼테이션 장치(100)는 상기 저장한 학습모델을 상기 학습모델 데이터베이스(320)로부터 로딩한 후, 상기 로딩한 학습모델에 상기 특정 이미지에 대한 멀티 스케일 이미지를 적용함으로써, 해당 이미지에 대한 시맨틱 이미지 세그먼테이션을 수행한다.Thereafter, the semantic image segmentation apparatus 100 loads the stored learning model from the learning model database 320 and then applies the multi-scale image of the specific image to the loaded learning model, thereby obtaining semantics of the corresponding image. Perform image segmentation.

또한 상기 특정 이미지는 사용자 단말(200)이나, 의료 시스템, 자율주행차량 등과 같은 특정 시스템과 연계된 카메라(미도시)로부터 입력될 수 있다.In addition, the specific image may be input from a user terminal 200 or a camera (not shown) associated with a specific system such as a medical system or an autonomous vehicle.

상기 사용자 단말(200)은 스마트 폰, PDA, 노트북 PC 등과 같이 사용자가 구비한 무선통신단말을 의미한다. The user terminal 200 refers to a wireless communication terminal provided by the user, such as a smart phone, PDA, notebook PC, and the like.

한편 상기 사용자 단말(200)은 상기에서 설명한 것과 같이 네트워크를 통해 시맨틱 이미지 세그먼테이션 장치(100)로 특정 이미지를 전송하여 상기 시맨틱 이미지 세그먼테이션 장치(100)로부터 시맨틱 이미지 세그먼테이션 결과를 수신할 수 있으며, 또한 상기 시맨틱 이미지 세그먼테이션 장치(100)로부터 상기 학습모델을 다운로드하여 사용자 단말(200) 자체에서 시맨틱 이미지 세그먼테이션을 수행할 수 도 있다. 이때, 상기 사용자 단말(200)은 시맨틱 이미지 세그먼테이션용 디바이스가 된다.Meanwhile, as described above, the user terminal 200 may receive a semantic image segmentation result from the semantic image segmentation device 100 by transmitting a specific image to the semantic image segmentation device 100 through a network. The learning model may be downloaded from the semantic image segmentation apparatus 100 to perform semantic image segmentation on the user terminal 200 itself. In this case, the user terminal 200 becomes a device for semantic image segmentation.

또한 시맨틱 이미지 세그먼테이션 장치(100)는 의료 시스템(미도시)과 연동하여, 의료 시스템으로부터 환자의 환부에 대한 이미지를 입력받아 해당 이미지로부터 환부를 세그먼테이션하여 의사와 같은 의료 관계자에게 제공할 수 있으며, 이를 토대로 환자에 대한 정확한 진단을 수행할 수 있도록 구현될 수 있다. 이때, 상기 시맨틱 이미지 세그먼테이션 장치(100)는 네트워크를 통해 상기 의료 시스템과 연동될 수 있으며, 상기 의료 시스템과 통합되어 로컬에서 환부 이미지에 대한 시맨틱 이미지 세그먼테이션을 수행할 수 있도록 구현될 수 있다.In addition, the semantic image segmentation apparatus 100 may interwork with a medical system (not shown), receive an image of a patient's affected part from the medical system, and segment the affected part from the corresponding image to provide it to a medical associate such as a doctor. It can be implemented to perform an accurate diagnosis on the basis of the patient. In this case, the semantic image segmentation apparatus 100 may be linked with the medical system through a network, and may be integrated with the medical system to perform semantic image segmentation on the affected image locally.

또한 시맨틱 이미지 세그먼테이션 장치(100)는 자율주행 차량과 연동하여, 자율주행 차량에 구비되는 카메라로부터 주행관련 이미지를 입력받아, 해당 이미지에 포함되어 있는 복수의 객체를 세그먼테이션함으로써, 상기 자율주행 차량의 주변에 위치하는 복수의 객체를 인식할 수 있도록 하여 안정적인 주행을 수행할 수 있도록 구현될 수 있다.In addition, the semantic image segmentation apparatus 100 receives a driving-related image from a camera provided in the autonomous vehicle in cooperation with the autonomous vehicle, and segments the plurality of objects included in the image, thereby surrounding the autonomous vehicle. By recognizing a plurality of objects located in the can be implemented to perform a stable driving.

즉, 본 발명의 일 실시예에 따른 시맨틱 이미지 세그먼테이션 장치(100)는 이미지 인식을 위한 다양한 분야에 적용되어 특정 이미지로부터 의미 있는 부분을 세그먼테이션함으로써, 상기 특정 이미지를 토대로 객체 인식과 장면 인식을 신속하고 효과적으로 수행할 수 있도록 구현될 수 있다.That is, the semantic image segmentation apparatus 100 according to an embodiment of the present invention may be applied to various fields for image recognition to segment meaningful parts from a specific image, thereby quickly and easily detecting object recognition and scene recognition based on the specific image. It can be implemented to perform effectively.

또한 데이터베이스(300)는 학습데이터를 저장하는 학습데이터 데이터베이스(310) 및 상기 얼굴인식 장치(100)에 의해 생성한 학습모델을 저장하는 학습모델 데이터베이스(320)를 포함하여 구성된다.In addition, the database 300 includes a learning data database 310 for storing learning data and a learning model database 320 for storing a learning model generated by the face recognition apparatus 100.

이하에서는 도 3 및 도 4를 참조하여, 본 발명의 일 실시예에 따른 멀티 스케일 확장된 컨볼루션의 캐스케이드 구조를 개략적으로 설명하도록 한다.Hereinafter, a cascade structure of a multi-scale extended convolution according to an embodiment of the present invention will be described with reference to FIGS. 3 and 4.

도 3은 본 발명의 일 실시예에 따른 서로 다른 레이트를 가지는 멀티 스케일 확장된 컨볼루션 캐스케이드를 이용한 조밀한 특징 추출을 설명하기 위한 도면이며, 도 4는 본 발명의 일 실시예에 따른 멀티 스케일 확장된 컨볼루션을 가지는 딥 컨볼루션 뉴럴 네트워크를 나타낸 도면이다.3 is a diagram for explaining dense feature extraction using a multi-scale extended convolution cascade having different rates according to an embodiment of the present invention, and FIG. 4 is a diagram illustrating multi-scale extension according to an embodiment of the present invention. A diagram showing a deep convolution neural network having a convolution.

도 3에 도시한 바와 같이, 본 발명의 일 실시예에 따른 멀티 스케일 확장된 컨볼루션의 캐스케이드 구조는, 종래의 기술에 따른 맥스 풀링 레이어 및 스트라이드 연산자와 달리, 확장된 컨볼루션 레이어(즉, 확장된 필터가 있는 컨볼루션 레이어라고도 함)를 포함하여 구성된다. 상기 확장된 컨볼루션 레이어는 일반적으로 컨볼루션을 계산하기 전에 필터를 확장함으로써, 해당 확장된 컨볼루션 레이어의 입력 매트릭스에 적용된다.As shown in FIG. 3, the cascade structure of the multi-scale extended convolution according to an embodiment of the present invention is different from the max pooling layer and the stride operator according to the related art. It is also known as a convolutional layer with filters). The extended convolutional layer is generally applied to the input matrix of the extended convolutional layer by extending the filter before calculating the convolution.

상기 필터의 크기가 확장되고, 빈 포지션(empty position)은 0으로 완전히 채워진다. 결과적으로 가중치(weight)는 입력 매트릭스의 거리 요소(distant elements)와 일치된다. 상기 거리 요소는 레이트(rate) r에 의해 결정된다.The size of the filter is expanded and the empty position is completely filled with zeros. As a result, the weight is consistent with the distance elements of the input matrix. The distance component is determined by the rate r.

만약, 커널 중심이 이미지의 임의의 위치에 정렬되면, 커널 요소는 도 3에 도시한 것과 같이 입력 요소와 일치하게 된다.If the kernel center is aligned at any position of the image, the kernel element will coincide with the input element as shown in FIG.

확장된 컨볼루션의 간단한 예를 들면, 1차원 신호에 1차원 확장된 컨볼루션을 적용할 수 있다. 필터 w[k]를 가지는 입력 x[i]의 확장된 컨볼루션의 출력 y[i]는 다음의 [수학식 1]에 의해 계산된다.As a simple example of extended convolution, one-dimensional extended convolution may be applied to a one-dimensional signal. The output y [i] of the expanded convolution of the input x [i] with the filter w [k] is calculated by Equation 1 below.

[수학식 1][Equation 1]

여기서, m은 필터 w[k]의 길이(length)다. Where m is the length of filter w [k].

또한 본 발명에 있어서, 상기 확장된 컨볼루션의 주된 장점은 컨볼루션 레이어에서 상기 필터의 수용필드를 확장하는 반면에 입력 매트릭스의 해상도를 감소시키지 않는 다는 것이다. 레이트 r을 가지는 확장된 컨볼루션을 적용함으로서, 커널 크기 k x k를 가지는 필터는 최다 k'까지 확장될 수 있다. 이때, k'은 k + (k - 1)(r - 1)의 값을 가진다.Also in the present invention, the main advantage of the extended convolution is that it extends the accepting field of the filter in the convolutional layer while not reducing the resolution of the input matrix. By applying extended convolution with rate r, a filter with kernel size k x k can be extended up to k '. At this time, k 'has a value of k + (k-1) (r-1).

상기 확장은 몇 가지 장점을 제공하지만, 또한 몇 가지 단점을 발생시킨다. 한편, 본 발명에서 완전 컨볼루션 뉴럴 네트워크의 계산 비용을 증가시키지 않으면서 멀티 스케일 이미지에 대한 보다 큰 컨텍스트를 캡쳐하기 위해 큰 레이트 r로 확장된 컨볼루션을 적용한다.The extension offers several advantages, but also introduces some disadvantages. On the other hand, the present invention applies extended convolution at a large rate r to capture a larger context for multi-scale images without increasing the computational cost of a full convolutional neural network.

그러나 큰 레이트 r을 가지는 확장된 컨볼루션은 필터 값을 사이에 더 많은 0을 도입하고, 더 작은 영역들에서보다 많은 로컬 컨텍스트 정보를 일어버릴 수 있다.However, extended convolution with a large rate r introduces more zeros between the filter values and can cause more local context information than in smaller regions.

또한 작은 레이트 r을 가지는 확장된 컨볼루션은 로컬화 성능을 향상시키는 데 사용될 수 있다. 그럼에도 불구하고 각 확장된 컨볼루션 레이어는 수용 범위가 좁은 특징 맵(feature map)을 생성한다. 그러므로 더 조밀한 특징맵을 추출하기 위해서는 서로 다른 레이트 r을 가지는 확장된 컨볼루션의 조합이 필요하다.Extended convolution with a small rate r can also be used to improve localization performance. Nevertheless, each extended convolutional layer produces a feature map with a narrow acceptance range. Therefore, in order to extract a more dense feature map, a combination of extended convolutions having different rates r is required.

따라서, 본 발명은 멀티 스케일 이미지에서 컨텍스트 정보를 추출하기 위한 멀티 스케일 확장된 컨볼루션 레이어의 캐스케이드 아키텍처를 제안한다.Accordingly, the present invention proposes a cascade architecture of a multiscale extended convolution layer for extracting context information from a multiscale image.

상기 캐스케이드 아키텍처는 연속된 컨볼루션 레이어로 구성되며, 상기 각각의 레이어는 동일한 수용필드가 있는 조밀한 특징맵을 생성하기 위해 동일한 레이트 r을 가지는 확장된 컨볼루션 커널만을 사용한다.The cascade architecture consists of successive convolution layers, each of which uses only an extended convolution kernel with the same rate r to produce a dense feature map with the same acceptance field.

또한 이전 컨볼루션 레이어는 현재의 컨볼루션 레이어보다 작은 레이트를 가지는 컨볼루션을 사용하므로, 로컬 특징을 추출하고 로컬화 정확도를 향상시킬 수 있다.In addition, the previous convolutional layer uses a convolution with a lower rate than the current convolutional layer, so that local features can be extracted and localization accuracy is improved.

대조적으로 현재의 컨볼루션 레이어는 컨텍스트 어시미레이션(assimilation)을 증가시키기 위해 더 큰 레이트를 가지는 확장된 컨볼루션을 사용한다.In contrast, the current convolutional layer uses extended convolution with a higher rate to increase context assist.

따라서 현재의 확장된 컨볼루션 레이어의 특징맵은 모이며, 이전의 특징맵보다 더 조밀해 질 수 있다.Therefore, the feature maps of the current extended convolution layer are aggregated and can be more dense than the previous feature maps.

한편 상기 멀티 스케일 확장된 컨볼루션 캐스케이드 아키텍처는 도 4에 나타낸 것과 같이, 고해상도로 최종 특징맵을 계산하기 위한 딥 컨볼루션 뉴럴 네트워크에 추가된다.Meanwhile, the multi-scale extended convolution cascade architecture is added to the deep convolutional neural network for calculating the final feature map at high resolution, as shown in FIG. 4.

도 5는 본 발명의 일 실시예에 따른 완전 컨볼루션 뉴럴 네트워크를 나타낸 도면이며, 도 6은 본 발명의 일 실시예에 따른 완전 컨볼루션 뉴럴 네트워크에 포함되는 맥스아웃 레이어를 나타낸 도면이다.FIG. 5 illustrates a full convolutional neural network according to an embodiment of the present invention, and FIG. 6 illustrates a max out layer included in the full convolutional neural network according to an embodiment of the present invention.

도 5에 도시한 바와 같이, 본 발명의 일 실시예에 따른 완전 컨볼루션 뉴럴 네트워크는, 원본 이미지에 대한 멀티 스케일 이미지에 대한 특징맵을 학습하도록 구성된다.As shown in FIG. 5, a full convolutional neural network according to an embodiment of the present invention is configured to learn a feature map for a multi-scale image with respect to an original image.

한편 멀티 스케일 이미지는 학습데이터의 원본 이미지로부터 복수의 상이한 스케일로 조정된 것을 의미함은 상술한 바와 같다.Meanwhile, as described above, the multi-scale image means that the image is adjusted to a plurality of different scales from the original image of the training data.

또한 완전 컨볼루션 뉴럴 네트워크는 상기 상이한 스케일로 조정된 각각의 이미지를 입력으로 하는 복수의 서브 네트워크를 포함하여 구성된다. 상기 서브 네트워크는 딥 컨볼루션 뉴럴 네트워크로 구성된다.The full convolutional neural network also comprises a plurality of sub-networks as inputs to each image adjusted at the different scales. The subnetwork consists of a deep convolutional neural network.

한편 컨텍스트 정보는 서로 다른 레이트 r을 가지는 확장된 컨볼루션으로부터 특징들이 결합됨으로써, 효과적으로 캡쳐된다. 그러나 큰 레이트 r을 사용하여 수용 필드의 크기를 확대하면, 확장된 컨볼루션이 연속적인 필터 값들 사이의 0(zero)을 도입하는 위치에서 컨텍스트 정보를 잃을 가능성이 높아진다. 상기 레이트 r이 클수록 더 많은 0이 도입된다.Context information, on the other hand, is effectively captured by combining features from extended convolution with different rates r. However, increasing the size of the acceptance field using a large rate r increases the chance of losing contextual information at locations where extended convolution introduces zeros between successive filter values. The larger the rate r, the more zeros are introduced.

또한 동일한 객체가 서로 다른 이미지에서 서로 다른 크기를 가질 수 있기 때문에, 멀티 스케일 이미지 입력에서 멀티 스케일 특징을 서칭(searching)하는 것이 합리적이다.It is also reasonable to search for multi-scale features in a multi-scale image input because the same object may have different sizes in different images.

따라서 상기 서브 네트워크에 멀티 스케일 이미지를 적용할 수 있으며, 상기 각 조정된 스케일은 상기 완전 컨볼루션 뉴럴 네트워크의 한 브랜치(branch)를 통과하고, 상기 특징들은 모든 스케일로부터 융합(fuse)된다. Thus, a multi-scale image can be applied to the sub-network, with each adjusted scale passing through a branch of the full convolutional neural network, and the features are fused from all scales.

일반적으로 하나의 딥 컨볼루션 뉴럴 네트워크에 멀티 스케일 이미지를 입력하는 것은 많은 수의 학습 파라미터가 필요하기 때문에 학습 과정에서의 비용이 매우 고가이다.In general, inputting a multi-scale image into a deep convolutional neural network requires a large number of learning parameters, which is very expensive in the learning process.

따라서 본 발명에서 제안하고 있는 완전 컨볼루션 뉴럴 네트워크는 멀티 스케일 이미지를 입력 매트릭스로 사용하고, 상기 각 서브 네트워크의 끝단에서 확장된 컨볼루션 캐스케이드 아키텍처를 통합함으로써, 학습 파라미터의 증가를 방지하여, 시맨틱 세그먼테이션을 위한 학습을 신속하고 정확하게 수행할 수 있다.Therefore, the fully convolutional neural network proposed by the present invention uses a multi-scale image as an input matrix, and integrates an extended convolution cascade architecture at the end of each sub-network, thereby preventing an increase in learning parameters, thereby increasing semantic segmentation. Learning can be done quickly and accurately.

또한 원본 이미지는 바이리니어 인터폴레이션(bilinear interpolation) 방법을 통해 여러 스케일로 조정되며, 상기 스케일이 조정된 각각의 이미지는 상기 완전 컨볼루션 뉴럴 네트워크의 특정 브랜치(즉, 각 서브 네트워크의 멀티 스케일 확장된 컨볼루션을 수행하는 확장된 컨볼루션 캐스케이드 아키텍처)를 통과한다. In addition, the original image is scaled to multiple scales through a bilinear interpolation method, with each scaled image being a specific branch of the full convolutional neural network (i.e. multi-scale extended convolution of each sub-network). Pass an extended convolution cascade architecture).

상기 각 브랜치의 출력은 다른 특징맵과 동일한 해상도로 재조정된 특징맵이다.The output of each branch is a feature map readjusted to the same resolution as the other feature maps.

마지막으로, 상기 모든 특징맵들은 공유된(shared) 특징맵으로 융합된다. 이미지에서의 각각의 객체는 서로 다른 특징맵에서 두드러지게 나타나기 때문에, 상기 완전 컨볼루션 뉴럴 네트워크는 모든 특징맵으로부터 경쟁력 있고 주요한 특징을 얻기 위해, 맥스아웃(maxout) 레이어를 채택하고, 상기 맥스아웃 레이어를 통해 상기 모든 특징맵을 공유된 특징맵으로 융합하도록 한다. Finally, all the feature maps are merged into a shared feature map. Since each object in the image is prominent on different feature maps, the full convolutional neural network employs a maxout layer and obtains the maxout layer to obtain competitive and key features from all feature maps. Through the fusion of all the feature map to the shared feature map.

또한 상기 맥스아웃 레이어는 도 6에 도시한 것과 같이, 최대 특징 맵(maximal feature map)으로 간주된다.Also, the max out layer is regarded as a maximum feature map, as shown in FIG. 6.

특히, 확장된 컨볼루션 레이어는 특징맵의 그룹이 포함되며, 이들 그룹의 동일한 좌표에 있는 특징값을 비교하여 최대값을 선택한 다음 맥스아웃 레이어의 동일한 좌표의 특징값을 상기 선택한 최대값으로 할당한다.In particular, the extended convolution layer includes groups of feature maps, compares feature values at the same coordinates of these groups, selects a maximum value, and then assigns the feature value of the same coordinates of the maxout layer to the selected maximum value. .

또한 완전 컨볼루션 뉴럴 네트워크는 도 5에 도시한 것과 같이, 완전 연결된 조건부 랜덤 필드를 포함하여 구성된다.The fully convolutional neural network also comprises a fully connected conditional random field, as shown in FIG.

완전 컨볼루션 뉴럴 네트워크는 객체를 식별하고, 해당 객체의 대략적인 위치에 대한 신뢰할 수 있는 최종 스코어 맵을 제공할 수 있으나, 경계가 정확하고 예리하게 추출되지 않을 수 있다. 이는 정확도에서 균형을 이루는 객체 식별 및 픽셀 수준의 객체 위치 파악이라는 두 가지 어려운 작업을 완료해야하는 사실에 기인한다.A full convolutional neural network may identify an object and provide a reliable final score map of the approximate location of that object, but the boundary may not be extracted accurately and sharply. This is due to the fact that two difficult tasks must be completed: object identification balancing pixel accuracy and locating object at pixel level.

따라서 본 발명은 완전 컨볼루션 뉴럴 네트워크의 끝단에 완전 연결된 조건부 랜덤 필드 모델을 적용하여 픽셀 레벨에서 객체의 위치를 파악하기 위한 객체 로컬화 문제를 해결하고, 객체의 경계를 정확하게 복구할 수 있도록 함으로써, 시맨틱 이미지 세그먼테이션의 정확성을 현저하게 향상시킬 수 있도록 한다.Accordingly, the present invention solves an object localization problem for locating an object at the pixel level by applying a conditional random field model that is fully connected to the end of a full convolutional neural network, and accurately recovers the boundary of the object. It can significantly improve the accuracy of semantic image segmentation.

상기 조건부 랜덤 필드 모델은 객체의 위치를 파악하고 객체의 경계를 복구하기 위해 다음의 [수학식 2]에 따라 유너리 포텐셜 및 페이와이즈 포텐셜로 구성되는 두 세트의 포텐셜 함수를 사용하는 에너지 함수(energy)를 최소화한다.The conditional random field model is an energy function that uses two sets of potential functions consisting of the potential potential and the phasewise potential according to Equation 2 to determine the position of the object and recover the boundary of the object. Minimize).

[수학식 2][Equation 2]

여기서 α_i(x_i)는 픽셀 i에서 레이블(label) x_i의 역우도를 측정하는 유너리 포텐셜이며, α_ij(x_i, x_j)는 레이블 x_i 및 x_j를 갖는 픽셀 i 및 픽셀 j에서 레이블 할당 비용을 추정하는 페어와이즈 포텐셜을 타나낸다.Wherein α _i (x _i) is an oil Nourishing potential for measuring station likelihood of the label (label) x _i from pixel i, α _ij (x _i, x _j) is the pixel i and a pixel having the label x _i and x _j In j, the pairwise potentials for estimating label allocation costs are shown.

또한 상기 유너리 포텐셜은 다음의 [수학식 3]에 의해 계산되는 픽셀 단위 분류기의 출력이다.In addition, the binary potential is the output of the pixel classifier calculated by Equation 3 below.

[수학식 3][Equation 3]

여기서, P(x_i)는 레이블 i를 픽셀 i에 할당할 확률이다. 또한 상기 P(x_i)는 본 발명의 완전 컨볼루션 뉴럴 네트워크에 의해 계산되며, 상기 페어와이즈 포텐셜은 픽셀과 상기 픽셀의 이웃 픽셀 사이의 이미지 그래디언트(gradient)를 기반으로 계산된다.Where P (x _i ) is the probability of assigning the label i to pixel i. The P (x _i ) is also calculated by the fully convolutional neural network of the present invention, and the pairwise potential is calculated based on an image gradient between a pixel and its neighboring pixels.

특히, 픽셀과 해당 픽셀의 이웃 픽셀은 계산된 그래디언트가 작을 경우 동일한 레이블로 분류된다. 따라서 페어와이즈 포텐셜은 세그먼테이션된 객체의 모양에 일관성을 부여하고 객체에 대한 묘사를 향상시킨다.In particular, pixels and neighboring pixels of the pixel are classified under the same label when the calculated gradient is small. Thus, Pairwise Potential brings consistency to the shape of the segmented object and improves the description of the object.

또한 페어와이즈 포텐셜은 다음의 [수학식 4]에 의해 계산된다.In addition, the pairwise potential is calculated by Equation 4 below.

[수학식 4][Equation 4]

여기서 f^m은 파라미터 ω_m에 의해 가중 된 가우시안 커널(gaussian kernel)이다. 상기 가우시안 커널 f^m은 각각 픽셀 i와 j에 대해 수집 된 특징 τ_i와 τ_j를 기반으로 계산된다. Where f ^m is a Gaussian kernel weighted by the parameter ω _m . The Gaussian kernel f ^m is calculated based on the features τ _i and τ _j collected for pixels i and j, respectively.

상기 가우시안 커널 f^m은 다음의 [수학식 5]에 의해 계산된다. The Gaussian kernel f ^m is calculated by the following Equation 5.

[수학식 5][Equation 5]

여기서 상기 특징 τ_i와 τ_j는 I로 표시되는 픽셀 컬러의 강도로 표현되고, 픽셀 위치는 p로 표현된다.Here, the features τ _i and τ _j are represented by the intensity of the pixel color represented by I, and the pixel position is represented by p.

마지막으로, 에너지 함수 E(x)는 입력 이미지에 대한 최상의 레이블 할당을 찾기 위해 최소화된다. 그러나 이 최소화 문제는 원래 다루기 힘든 문제이다. 따라서 본 발명은 확률적 추론을 효율적으로 근사하기 위해, 조건부 랜덤 필드 모델의 분포에 평균-필드(mean-field) 근사를 적용한다.Finally, the energy function E (x) is minimized to find the best label assignment for the input image. However, this minimization problem is inherently difficult to deal with. Therefore, the present invention applies mean-field approximation to the distribution of the conditional random field model in order to approximate the stochastic inference efficiently.

즉, 조건부 랜덤 필드 모델의 분포 P(x)는 분포 Q(x)에 의해 근사화되며, 상기 Q(x)는 ∏Q_i(x_i)로 표현될 수 있다.That is, the distribution P (x) of the conditional random field model is approximated by the distribution Q (x), and Q (x) may be expressed as ∏Q _i (x _i ).

도 7은 본 발명의 일 실시예에 따른 시맨틱 이미지 세그먼테이션 장치의 구성을 나타낸 블록도이다.7 is a block diagram illustrating a configuration of a semantic image segmentation apparatus according to an embodiment of the present invention.

도 7에 도시한 바와 같이, 본 발명의 일 실시예에 따른 시맨틱 이미지 세그먼테이션 장치(100)는 시맨틱 이미지 세그먼테이션을 수행하기 위한 적어도 하나 이상의 이미지를 수집하는 이미지 수집부(110), 학습데이터를 구성하는 복수의 이미지에 대한 멀티 스케일 이미지를 생성하는 전처리부(120), 상기 생성한 멀티 스케일 이미지를 학습하여 시맨틱 이미지 세그먼테이션용 학습모델을 생성하는 시맨틱 이미지 세그먼테이션용 학습모델 생성부(130), 상기 학습모델을 이용하여 상기 이미지 수집부(110)를 통해 수집한 특정 이미지에 대해서 시맨틱 이미지 세그먼테이션을 수행하는 시맨틱 이미지 세그먼테이션부(140), 상기 수행한 시맨틱 이미지 세그먼테이션에 대한 결과를 제공하는 시맨틱 이미지 세그먼테이션 결과 제공부(150), 데이터베이스 인터페이스부(160) 및 상기 시맨틱 이미지 세그먼테이션 장치(100)를 전반적으로 제어하기 위한 제어부(170)를 포함하여 구성된다.As shown in FIG. 7, the semantic image segmentation apparatus 100 according to an embodiment of the present invention comprises an image collection unit 110 for collecting at least one image for performing semantic image segmentation, and constituting learning data. Pre-processing unit 120 for generating a multi-scale image for a plurality of images, Semantic image segmentation learning model generation unit 130 for learning the generated multi-scale image to generate a learning model for the semantic image segmentation, the learning model A semantic image segmentation unit 140 that performs semantic image segmentation on a specific image collected by the image collecting unit 110 by using a semantic image segmentation result providing unit providing a result of the semantic image segmentation performed. 150, database interface 160 and is configured to include a control unit 170 for overall control of the semantic image segmentation unit (100).

또한 이미지 수집부(110)는 사용자의 요청에 의해 시맨틱 이미지 세그먼테이션을 수행하기 위한 특정 이미지를 수집하는 기능을 수행한다.In addition, the image collector 110 performs a function of collecting a specific image for performing semantic image segmentation at the user's request.

상기 특정 이미지는 네트워크를 통해 사용자 단말(200)로부터 수집되거나, 특정 시스템(예: 의료 시스템)으로부터 수집될 수 있다.The specific image may be collected from the user terminal 200 through a network or from a specific system (eg, a medical system).

즉, 상기 특정 이미지는 네트워크를 통해서 수집되거나, 로컬에서 직접적으로 입력될 수 있다. 다만, 상기 이미지 수집부(110)는 시맨틱 이미지 세그먼테이션을 위해 적어도 하나 이상의 이미지를 수집하는 것으로, 상기 이미지를 수집하는 방법에 대해서는 그 제한을 두지 않는다.That is, the specific image may be collected through a network or directly input locally. However, the image collector 110 collects at least one image for semantic image segmentation, and the method of collecting the image is not limited thereto.

또한 이미지 수집부(110)는 상기 시멘틱 이미지 세그먼테이션용 학습모델 생성부(130)에 의해 생성되는 시맨틱 세그먼테이션용 학습모델을 업데이트할 수 있도록 학습데이터의 기반이 되는 이미지를 주기적으로 수집할 수 있다. 즉, 이미지 수집부(110)는 시맨틱 세그먼테이션을 수행하기 위한 학습데이터를 제공하는 기관 등으로부터 이미지를 주기적으로 수집하여 상기 학습데이터 데이터베이스(310)에 상기 수집한 이미지를 반영함으로써, 상기 시맨틱 이미지 세그먼테이션용 학습모델을 최신의 상태로 업데이트할 수 있도록 한다.In addition, the image collector 110 may periodically collect the image that is the basis of the training data to update the semantic segmentation learning model generated by the semantic image segmentation learning model generation unit 130. That is, the image collecting unit 110 periodically collects an image from an institution that provides learning data for performing semantic segmentation, and reflects the collected image in the learning data database 310, thereby providing the semantic image segmentation. Keep your learning model up to date.

또한 전처리부(120)는 상기 시맨틱 이미지 세그먼테이션용 학습모델 생성부(130)를 통해 학습데이터를 학습할 수 있도록, 상기 학습데이터의 원본 이미지를 전처리하여 멀티 스케일 이미지를 생성한다.In addition, the preprocessor 120 generates a multi-scale image by preprocessing the original image of the training data so as to learn the training data through the learning model generator 130 for semantic image segmentation.

상기 전처리부(120)는 상기 원본 이미지를 미리 설정된 비율로 상기 멀티 스케일 이미지는 상기 원본 이미지를 설정된 스케일 비율(예: 1.0, 0.75, 0.5)로 조정하여 복수의 스케일된 이미지를 생성한다.The preprocessor 120 generates a plurality of scaled images by adjusting the original image at a preset ratio and adjusting the original image at a preset scale ratio (eg, 1.0, 0.75, 0.5).

즉, 상기 전처리부(120)는 상기 원본 이미지에 대한 스케일을 미리 설정된 비율로 조정하여, 상기 원본 이미지에 대한 멀티 스케일 이미지를 생성하는 것이다.That is, the preprocessor 120 generates a multi-scale image of the original image by adjusting the scale of the original image at a preset ratio.

또한 상기 전처리부(120)는 사용자로부터 시맨틱 이미지 세그먼테이션을 위한 특정 이미지가 입력되는 경우에도, 상기와 같은 과정을 통해 해당 이미지에 대한 멀티 스케일 이미지를 생성한다. In addition, even when a specific image for semantic image segmentation is input from the user, the preprocessor 120 generates a multi-scale image of the corresponding image through the above process.

또한 시맨틱 이미지 세그먼테이션용 학습모델 생성부(130)는 상기 전처리부(120)를 통해 상기 생성한 학습데이터에 대한 복수의 스케일된 이미지를 학습하여 시맨틱 이미지 세그먼테이션용 학습모델을 생성한다.In addition, the learning model generation unit 130 for semantic image segmentation generates a learning model for semantic image segmentation by learning a plurality of scaled images of the generated learning data through the preprocessor 120.

상기 시맨틱 이미지 세그먼테이션용 학습모델 생성부(130)는 본 발명에서 제안하는 완전 컨볼루션 뉴럴 네트워크를 토대로 상기 복수의 스케일된 이미지를 학습한다.The learning model generation unit 130 for semantic image segmentation learns the plurality of scaled images based on the complete convolutional neural network proposed by the present invention.

상기 완전 컨볼루션 뉴럴 네트워크는 복수의 서브 네트워크를 포함하며, 상기 서브 네트워크에 상기 복수의 스케일된 이미지가 각각 입력된다.The full convolutional neural network includes a plurality of sub-networks, and the plurality of scaled images are respectively input to the sub-networks.

또한 상기 각 서브 네트워크는 복수의 컨볼루션 레이어와 완전 연결된 레이어(fully connected layer) 및 멀티 스케일 확장된 컨볼루션 아키텍처를 포함한다. Each sub-network also includes a plurality of convolution layers, a fully connected layer and a multi-scale extended convolution architecture.

상기 멀티 스케일 확장된 컨볼루션 아키텍처는 복수의 멀티 스케일 확장된 컨볼루션 레이어가 캐스케이드 형태로 구성되며, 상기 각 서브 네트워크의 완전 연결된 레이어를 통해 출력되는 특징맵에 대한 해상도를 복구하여, 고해상도의 최종 특징맵을 생성한다.In the multi-scale extended convolution architecture, a plurality of multi-scale extended convolution layers are formed in a cascade form, and the resolution of the feature map output through the fully connected layer of each sub-network is restored, resulting in a high resolution final feature. Create a map.

즉, 멀티 스케일 확장된 컨볼루션 아키텍처는 상기 각 서브 네트워크의 특징맵에 대한 해상도를 복구함으로써, 상기 서브 네트워크를 통해 감소된 해상도를 복수시키도록 하는 것이다.In other words, the multi-scale extended convolution architecture restores the resolution of the feature map of each sub-network so as to a plurality of reduced resolutions through the sub-network.

또한 완전 컨볼루션 뉴럴 네트워크는 상기 각 멀티 스케일 확장된 컨볼루션 아키텍처를 통해 출력되는 각각의 최종 특징맵이 모두 동일한 해상도를 가지도록 재조정하여, 상기 재조정한 최종 특징맵을 출력한다.In addition, the full convolutional neural network readjusts each final feature map output through each of the multiscale extended convolution architectures to have the same resolution, and outputs the adjusted final feature map.

또한 상기 완전 컨볼루션 뉴럴 네트워크는 상기 재조정된 특징맵을 통합하여 객체에 대한 특징값을 추출하기 위한 맥스아웃 레이어 및 상기 맥스아웃 레이어와 연결되는 완전 연결된 조건부 랜덤 필드 모델을 포함하여 구성된다.The fully convolutional neural network also includes a max out layer for extracting feature values for an object by incorporating the readjusted feature map and a fully connected conditional random field model connected with the max out layer.

즉, 맥스아웃 레이어는 상기 재조정된 최종 특징맵을 통합하여 객체에 대한 경계를 추출하며, 상기 완전 연결된 조건부 랜덤 필드 모델은 상기 추출한 객체의 대한 경계를 복구하는 기능을 수행한다.That is, the max out layer integrates the readjusted final feature map to extract a boundary of an object, and the fully connected conditional random field model performs a function of restoring the boundary of the extracted object.

즉, 완전 컨볼루션 뉴럴 네트워크는 도 5를 참조하여 설명한 것과 같이, 멀티 스케일 이미지 입력, 확장된 컨볼루션 및 완전 연결된 조건부 랜덤 필드 모델을 토대로 상기 학습데이터를 신속하고 정확하게 학습하여, 시맨틱 이미지 세그먼테이션을 위한 학습모델을 생성하게 된다.That is, the fully convolutional neural network learns the training data quickly and accurately based on a multi-scale image input, an extended convolution, and a fully connected conditional random field model, as described with reference to FIG. 5, for semantic image segmentation. You will create a learning model.

또한 시맨틱 이미지 세그먼테이션부(140)는 사용자로부터 시맨틱 이미지 세그먼테이션을 위한 특정 이미지를 입력받은 경우, 상기 전처리부(120)를 제어하여 상기 특정 이미지로부터 멀티 스케일 이미지를 생성한다.In addition, when the semantic image segmentation unit 140 receives a specific image for semantic image segmentation from the user, the semantic image segmentation unit 140 controls the preprocessor 120 to generate a multi-scale image from the specific image.

또한 시맨틱 이미지 세그먼테이션부(140)는 시맨틱 이미지 세그먼테이션용 학습모델을 상기 학습모델 데이터베이스(320)로부터 로딩하여, 상기 생성한 멀티 스케일 이미지를 상기 로딩한 시맨틱 이미지 세그먼테이션용 학습모델에 적용함으로서, 상기 특정 이미지에 대한 시맨틱 이미지 세그먼테이션을 수행하게 된다.In addition, the semantic image segmentation unit 140 loads the learning model for semantic image segmentation from the learning model database 320 and applies the generated multi-scale image to the loaded semantic image segmentation learning model, thereby providing the specific image. You will perform semantic image segmentation on.

또한 시맨틱 이미지 세그먼테이션 결과 제공부(150)는 상기 시맨틱 이미지 세그먼테이션부(140)를 통해 수행한 시맨틱 이미지 세그먼테이션에 대한 결과를 상기 사용자에게 제공할 수 있도록 한다.In addition, the semantic image segmentation result providing unit 150 may provide the user with a result of the semantic image segmentation performed through the semantic image segmentation unit 140.

또한 데이터베이스 인터페이스부(160)는, 데이터베이스(300)와 연결되어, 상기 시맨틱 이미지 세그먼테이션 장치(100)에 의해 생성되거나 필요한 데이터를 상기 데이터베이스(300)로부터 저장하거나 로딩하는 기능을 수행한다.In addition, the database interface unit 160 is connected to the database 300 to perform a function of storing or loading data generated or needed by the semantic image segmentation apparatus 100 from the database 300.

또한 제어부(170)는 시맨틱 이미지 세그먼테이션 장치(100)의 각 구성부분에 대한 데이터 이동 및 각 구성부분의 구성을 전반적으로 제어하는 기능을 수행한다.In addition, the controller 170 performs a function of controlling data movement and overall configuration of each component of the semantic image segmentation apparatus 100.

이하에서는 본 발명의 완전 컨볼루션 뉴럴 네트워크를 타 방법 간의 시맨틱 이미지 세그먼테이션에 대한 성능을 비교하고자 한다.Hereinafter, we will compare the performance of the semantic image segmentation between the full convolutional neural network of the present invention.

상기 비교를 위해 본 발명의 완전 컨볼루션 뉴럴 네트워크를 ResNet-101(deep residual learning for image recognition. arXiv:1512.03385 참조) 및 VGG-16(very deep convolutional neural networks for large-scale image recognition. arXiv: 1409.1566 참조)을 기반으로 구현하였다.For the comparison see the complete convolutional neural networks of the present invention see ResNet-101 (deep residual learning for image recognition.arXiv: 1512.03385) and VGG-16 (very deep convolutional neural networks for large-scale image recognition.arXiv: 1409.1566 Based on

도 8a는 본 발명의 일 실시예에 따른 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크의 구조를 나타낸 도면이다.8A is a diagram illustrating the structure of a ResNet-101 based fully convolutional neural network according to an embodiment of the present invention.

한편 네트워크의 깊이는 뉴럴 네트워크의 정확성을 향상시키는데 매우 중요하다. 그러나 깊은 네트워크를 학습시키는 것은 매우 어려운 과제이다.On the other hand, network depth is very important for improving the accuracy of neural networks. But learning deep networks is a very difficult task.

딥 잔차 네트워크(deep residual network)는 딥 컨볼루션 뉴럴 네트워크로써, 종래의 딥 뉴럴 네트워크보다 더 깊은 수준에서 학습을 수행할 수 있다. 이는 학습 과정을 더 쉽게 만드는 잔차 학습 프레임워크를 채택하고, 네트워크의 깊이를 증가시켜 더 나은 성능을 향상시키기 때문이다.The deep residual network is a deep convolutional neural network, and can perform learning at a deeper level than a conventional deep neural network. This is because it adopts a residual learning framework that makes the learning process easier, and increases the depth of the network to improve better performance.

딥 잔차 네트워크가 이미지 분류에서 최첨단 성능에 도달하였기 때문에 다중 스케일 입력 이미지가 있는 ResNet 기반 모델을 이용하여 본 발명의 완전 컨볼루션 뉴럴 네트워크를 구현하였다.Since the deep residual network has reached the state-of-the-art performance in image classification, we implemented the fully convolutional neural network of the present invention using a ResNet-based model with multi-scale input images.

도 8a에 도시한 바와 같이, 본 발명의 일 실시예에 따른 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크의 구조는 스케일링된 각 이미지에 대해 5개의 블록과 100개 이상의 레이어를 사용하며, 이러한 네트워크는 많은 수의 파라미터로 구성되어 학습 과정에 있어서 GPU 장치의 최대 RAM 용량에 쉽게 도달한다. 따라서 정확도를 증가시키기 위한 계산 비용이 매우 높다.As shown in FIG. 8A, the structure of a ResNet-101 based fully convolutional neural network according to an embodiment of the present invention uses 5 blocks and 100 or more layers for each scaled image, which is a large number of networks. It consists of a number of parameters that easily reach the maximum RAM capacity of the GPU device in the learning process. Therefore, the computational cost to increase the accuracy is very high.

이러한 이유로 본 발명은 스케일링된 이미지 입력을 두개만 사용하였으며, 상기 각각의 스케일은 1.0과 0.75로 설정되어 상기 각 스케일링된 이미지는 ResNet-101의 각 잔차 네트워크에 입력된다.For this reason, the present invention uses only two scaled image inputs, and each scale is set to 1.0 and 0.75 so that each scaled image is input to each residual network of ResNet-101.

상기 각 잔차 네트워크는 마지막 블록의 출력에서 가져온 특징맵에서 스케일된 이미지 입력의 특징을 서머라이즈(summarize)할 수 있다. 또한 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크는 상기 특징맵으로부터 멀티 스케일 컨텍스트 정보를 캡쳐하기 위해 확장된 컨볼루션 레이어의 캐스케이드 아키텍처를 통과시켜 이에 대한 스코어 맵을 생성한다.Each residual network may summarize the features of the scaled image input in the feature map taken from the output of the last block. In addition, a ResNet-101 based fully convolutional neural network passes through the cascaded architecture of the extended convolutional layer to generate multi-scale contextual information from the feature map to generate a score map for it.

또한 상기 캐스케이드 아키텍처는 상기 스코어 맵의 해상도를 높이기 위해 바이리니어 인터폴레이션 방법을 적용하여 상기 특징맵들이 동일한 해상도를 가지도록 한다.In addition, the cascade architecture applies a bilinear interpolation method to increase the resolution of the score map so that the feature maps have the same resolution.

그런 다음, ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크는 맥스아웃 레이어를 통해 상기 스코어 맵을 경쟁적인 특징들을 유지할 수 있는 최종 스코어 맵으로 병합하는데 중요한 역할을 수행하게 된다.The ResNet-101 based fully convolutional neural network then plays an important role in merging the score map through the max out layer into a final score map that can maintain competitive features.

마지막으로 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크는 해당 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크의 출력에 통합되는 완전 연결된 조건부 랜덤 필드 모델을 통해 픽셀 수준의 페어와이즈 유사성을 인코딩하고 객체 경계를 선명하게 만든다. Finally, ResNet-101-based fully convolutional neural networks encode pixel-level pairwise similarities and sharpen object boundaries with fully-connected conditional random field models that are integrated into the output of their ResNet-101-based fully convolutional neural networks. .

도 8b는 본 발명의 일 실시예에 따른 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크의 구조를 나타낸 도면이다.8B is a diagram showing the structure of a VGG-16 based fully convolutional neural network according to an embodiment of the present invention.

도 8b에 도시한 바와 같이, 본 발명의 일 실시예에 따른 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크의 구조는 1.0, 0.75 및 0.5의 비율로 스케일된 3개의 이미지를 입력으로 하며, 상기 각 스케일된 이미지에 대해 16개의 가중된 레이어를 포함하여 구성되어 상기 스케일된 이미지를 학습하도록 구현된다.As shown in FIG. 8B, the structure of the VGG-16-based full convolutional neural network according to an embodiment of the present invention receives three images scaled at a ratio of 1.0, 0.75, and 0.5, and each scaled It is configured to include 16 weighted layers for the image and is implemented to learn the scaled image.

또한 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크의 구조는 본 발명에서 제안하는 멀티 스케일 확장된 컨볼루션 아키텍처 및 완전 연결된 조건부 랜덤 필드 모델을 포함하여 구성된다.In addition, the structure of the VGG-16-based full convolutional neural network includes a multi-scale extended convolution architecture and a fully connected conditional random field model proposed by the present invention.

도 9는 본 발명의 일 실시예에 따른 PASCAL VOC 2012 데이터세트를 이용하여 완전 컨볼루션 뉴럴 네트워크를 통한 시맨틱 세그먼테이션의 정확도와 타 방법을 통한 시맨틱 세그먼테이션의 정확도를 비교한 도면이다. FIG. 9 is a diagram comparing the accuracy of semantic segmentation through a completely convolutional neural network with the accuracy of semantic segmentation through other methods using the PASCAL VOC 2012 dataset according to an embodiment of the present invention.

도 9에 도시한 바와 같이, 본 발명의 일 실시예에 따른 완전 컨볼루션 뉴럴 네트워크를 통한 시맨틱 세그먼테이션의 정확도와 타 방법을 통한 시맨틱 세그먼테이션의 정확도를 비교하였다.As shown in FIG. 9, the accuracy of semantic segmentation through a full convolutional neural network according to an embodiment of the present invention is compared with the accuracy of semantic segmentation through another method.

상기 비교를 위해 PASCAL VOC 2012 데이터베이스를 이용하였으며, 본 발명의 완전 컨볼루션 뉴럴 네트워크와 FCN-8s 모델, 딥랩(DeepLab) 네트워크, 박스섭(BoxSup) 네트워크, 줌-아웃(Zoom-out) 네트워크, CRF-RNN 네트워크, DPN 네트워크 및 DeconvNet 네트워크와 같은 최근에 개발되고 있는 최첨단 알고리즘을 평가하였다.PASCAL VOC 2012 database was used for the comparison, and the complete convolutional neural network, FCN-8s model, DeepLab network, BoxSup network, Zoom-out network, CRF of the present invention We evaluated recently developed state-of-the-art algorithms such as -RNN network, DPN network and DeconvNet network.

또한 MS-COCO 데이터 세트를 이용하여 본 발명의 완전 컨볼루션 뉴럴 네트워크를 사전에 학습시켰으며, 상기 PASCAL VOC 2012 데이터베이스를 사용하여 사람, 병, 자동차, 기차 및 하나의 배경 클래스를 포함하는 20개의 클래스를 학습시켰다.In addition, the complete convolutional neural network of the present invention was previously trained using the MS-COCO data set, and 20 classes including people, bottles, cars, trains, and one background class were used using the PASCAL VOC 2012 database. To learn.

원본 학습데이터는 1464 이미지로 구성되어 있기 때문에, 본 발명의 완전 컨볼루션 뉴럴 네트워크를 학습시키기에 충분하지 않으므로 Hariharan et al.(Semantic contours inverse detectors IN: International coference on computer vision(ICCV))에 의해 추가적으로 보강된 학습데이터를 사용했다. 전체적으로 학습과정에서 총 10,582개에 대한 학습이미지를 사용했다.Since the original training data consisted of 1464 images, it is not sufficient to train the complete convolutional neural network of the present invention and is further added by Hariharan et al. (Semantic contours inverse detectors IN: International coference on computer vision (ICCV)) We used reinforced learning data. In total, 10,582 learning images were used in the learning process.

또한 상기 시맨틱 이미지 세그먼테이션에 대한 정확성을 평가하기 위해 피부 병변 이미지의 가장 큰 컬렉션인 흑색 종 탐지에 대한 피부 병변 분석(Skin Lesion Analysis Toward Melanoma Detection)의 데이터 세트를 더 사용했다. 피부 병변의 낮은 대비를 다루어야하기 때문에 피부 현미경 이미지에서 흑색종을 인지하는 것은 매우 어려운 문제이다. We also used a data set of Skin Lesion Analysis Toward Melanoma Detection for melanoma detection, the largest collection of skin lesion images, to assess the accuracy of the semantic image segmentation. Recognizing melanoma in skin microscopic images is a very difficult problem because of the low contrast of skin lesions.

상기 피부 병변 분석의 데이터 세트는 900개의 학습을 위한 이미지와 350개의 테스트를 위한 이미지로 구성된다.The data set of skin lesion analysis consists of images for 900 training and images for 350 tests.

또한 모든 학습데이터에 대한 이미지 및 테스트 이미지는 1024 x 768의 고해상도 이미지이기 때문에 상기 학습데이터에 대한 이미지에서 동일한 크기로 무작위로 잘라 학습을 위한 학습샘플을 증가시켰다.In addition, since the image and the test image of all the training data is a high resolution image of 1024 x 768, the learning sample for learning was increased by randomly cutting the same size from the image for the training data.

또한 테스트 과정에서, 전체 이미지는 전체 이미지에 오버랩되는(overlapped) 서브 이미지의 예측 결과를 결합함으로써, 전체적으로 세그먼테이션된다.Also during the test, the entire image is segmented globally by combining the prediction results of the sub-images that overlap the entire image.

또한 상기 평가는 실체 측정된 바운딩 박스(ground-truth bounding box)와 예측된 바운딩 박스의 세트에 대한 픽셀 IoU(intersection over union) 스코어를 적용함으로써, 수행된다.The evaluation is also performed by applying pixel intersection over union (IoU) scores for a set of tangentially measured ground-truth bounding boxes and predicted bounding boxes.

상기 IoU는 객체 세그먼테이션 알고리즘의 정확성을 측정하기 위한 방법이다. 객체 세그먼테이션 알고리즘이 예측된 바운딩 박스를 제공할 수 있는 경우, 상기 IoU 스코어를 측정하여, 상기 세그먼테이션에 대한 정확성을 평가할 수 있다.The IoU is a method for measuring the accuracy of an object segmentation algorithm. If the object segmentation algorithm can provide a predicted bounding box, the IoU score can be measured to assess the accuracy of the segmentation.

상기 IoU 스코어에 대한 측정은 다음의 [수학식 6] 과 같이 상기 바운딩 박스 세트에 대한 교차점의 사이즈를 샘플 세트의 합집합에 대한 사이즈로 나눈 값으로 정의된다.The measurement of the IoU score is defined as a value obtained by dividing the size of the intersection point for the bounding box set by the size of the union of the sample sets as shown in Equation 6 below.

[수학식 6][Equation 6]

여기서, P∩G는 예측된 바운딩 박스 P와 실제 측정된 바운딩 박스 G 사이의 오버랩(overlap)된 영역이고, P∪G는 상기 P와 G가 모두 포함되는 영역을 의미한다.Here, P∩G is an overlapped area between the predicted bounding box P and the actually measured bounding box G, and P∪G means an area in which both P and G are included.

또한 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크는 상기에서 언급한 것과 같이, 원래의 ResNet-101 네트워크에 대한 마지막의 완전 연결된 레이어를 본 발명의 확장된 컨볼루션 캐스케이드를 대체함으로써, 상기 ResNet-101 완전 컨볼루션 뉴럴 네트워크가 완전 컨볼루션 되도록 수정되었다.The ResNet-101 based fully convolutional neural network may also replace the ResNet-101 complete convolution by replacing the last fully connected layer for the original ResNet-101 network with the extended convolution cascade of the present invention, as mentioned above. The correction neural network has been modified to be fully convolutional.

상기 각각의 캐스케이드는 6, 12, 18 및 25에 해당하는 레이트 r을 가지는 확장된 컨볼루션 레이어로 구성된다. 또한 완전 연결된 조건부 랜덤 필드의 레이어는 학습단계에서 상기 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크와 분리된다.Each cascade consists of an extended convolutional layer having rates r corresponding to 6, 12, 18 and 25. In addition, the layer of the fully connected conditional random field is separated from the ResNet-101 based fully convolutional neural network in the learning phase.

또한 마지막 레이어에서 학습 객체 클래스의 수를 변경하고 최종 밀집한 특징맵에서 각 공간 위치에 대한 교차-엔트로피(cross-entropy) 항(term)의 총합인 손실 함수(loss function)를 적용함으로서, 상기 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크를 미세조정하였다.In addition, by changing the number of learning object classes in the last layer and applying a loss function that is the sum of cross-entropy terms for each spatial position in the final dense feature map, the ResNet- The 101-based full convolutional neural network was fine tuned.

또한 상기 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크의 초기 학습 레이트는 0.001로, 모멘텀(momemtum)은 0.9로, 가중치 감소는 0.005로 설정되었으며, 상기 학습 레이트는 20,000번 반복 후, 0.1을 곱한다.In addition, the initial learning rate of the ResNet-101 based fully convolutional neural network is set to 0.001, the momentum (momemtum) is set to 0.9, the weight reduction is set to 0.005, the learning rate is multiplied by 0.1 after 20,000 iterations.

또한 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크는 멀티 스케일 입력의 수를 제외하고는 상기 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크와 유사하다. VGG-16 기반 완전 컨볼루션 뉴럴 네트워크는 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크의 2개의 입력 대신에 1, 0.75 및 0.5의 세 가지 비율로 스케일된 이미지를 입력으로 한다.The VGG-16 based fully convolutional neural network is also similar to the ResNet-101 based fully convolutional neural network except for the number of multi-scale inputs. The VGG-16 based fully convolutional neural network accepts images scaled at three ratios of 1, 0.75 and 0.5 instead of the two inputs of the ResNet-101 based fully convolutional neural network.

VGG-16 기반 완전 컨볼루션 뉴럴 네트워크를 학습시키기 위해 각 공간 위치에 대한 교차-엔트로피 항에 대한 총합인 손실 함수를 최소화하기 위한 SDG(stochastic gradient descent)를 채택했다.To train the VGG-16 based fully convolutional neural network, we adopted stochastic gradient descent (SDG) to minimize the loss function, which is the sum of the cross-entropy terms for each spatial location.

상기 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크의 초기 학습 레이트는 0.001로 설정되었으며, 미니 배치 사이즈(mini-batch size)는 0.001의 초기 학습 레이트를 가지는 20개의 이미지이다. 상기 학습 레이트는 2000회 반복수 0.1을 곱하며, 모멘텀은 0.9로, 가중치 감소는 0.0005로 설정되었다.The initial learning rate of the VGG-16 based fully convolutional neural network is set to 0.001, and the mini-batch size is 20 images with an initial learning rate of 0.001. The learning rate was multiplied by 0.1 for the 2000 iterations, the momentum was set to 0.9, and the weight reduction was set to 0.0005.

또한 도 9에 도시한 것과 같이, 본 발명의 일 실시예에 따른 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크는 모든 방법들 사이에서 최고의 결과인 IoU 스코어 78.5를 달성함을 알 수 있다.In addition, as shown in Figure 9, it can be seen that the ResNet-101 based fully convolutional neural network according to an embodiment of the present invention achieves the best result IoU score 78.5 among all methods.

또한 본 발명의 일 실시예에 따른 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크는 IoU 스코어 74.8을 달성하였으며, 이는 DPN 및 BoxSup 네트워크를 제외하고는 여전이 타 방법보다 정확도가 높은 것을 알 수 있다.In addition, the VGG-16 based fully convolutional neural network according to an embodiment of the present invention achieved an IoU score of 74.8, which is still higher than other methods except DPN and BoxSup networks.

이것은, 멀티 스케일 특징 추출이 서로 다른 컨텍스트 및 스케일에서 객체를 인식하는데 중요한 역할을 한다는 사실로 증명된다.This is evidenced by the fact that multi-scale feature extraction plays an important role in recognizing objects in different contexts and scales.

따라서 본 발명에서 제안하고 있는 멀티 스케일 이미지 입력 및 멀티 스케일 확장 컨볼루션 아키텍처를 통한 시멘틱 이미지 세그먼테이션은 싱글 스케일 이미지 입력만을 사용하는 타 방법보다 월등히 뛰어남을 알 수 있다.Therefore, it can be seen that the semantic image segmentation through the multi-scale image input and the multi-scale extended convolution architecture proposed by the present invention is superior to other methods using only the single-scale image input.

또한 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크를 통한 시맨틱 이미지 세그먼테이션이 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크를 통한 시맨틱 이미지 세그먼테이션보다 더 나은 성능을 달성하는 것을 알 수 있다.We also see that semantic image segmentation over ResNet-101-based fully convolutional neural networks achieves better performance than semantic image segmentation over VGG-16-based fully convolutional neural networks.

도 10은 본 발명의 일 실시예에 따른 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크와 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크의 처리속도를 비교한 도면이다.FIG. 10 is a diagram comparing processing speeds of a VGG-16 based fully convolutional neural network and a ResNet-101 based fully convolutional neural network according to an embodiment of the present invention.

도 10에 도시한 바와 같이, 본 발명의 일 실시예에 따른 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크를 통한 시맨틱 이미지 세그먼테이션이 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크를 통한 시맨틱 이미지 세그먼테이션보다 더 나은 성능을 달성하지만 처리속도는 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크가 빠른 것을 알 수 있다.As shown in FIG. 10, semantic image segmentation over a ResNet-101 based fully convolutional neural network according to an embodiment of the present invention provides better performance than semantic image segmentation over a VGG-16 based fully convolutional neural network. We can see that the VGG-16-based fully convolutional neural network is fast, but achieves this.

이는 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크의 학습 파라미터가 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크의 학습 파라미터보다 훨씬 적기 때문이다.This is because the learning parameters of the VGG-16 based fully convolutional neural network are much less than the learning parameters of the ResNet-101 based fully convolutional neural network.

따라서 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크는 시맨틱 이미지 세그먼테이션 작업에서 실시간 적용을 위해 사용될 수 있음을 보여준다.Thus, the VGG-16 based fully convolutional neural network shows that it can be used for real-time applications in semantic image segmentation tasks.

도 11은 본 발명의 일 실시예에 따른 완전 연결된 조건부 랜덤 필드 모델을 적용하여 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크와 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크의 성능을 비교한 도면이다. FIG. 11 is a diagram comparing the performance of a ResNet-101 based fully convolutional neural network and a VGG-16 based fully convolutional neural network by applying a fully connected conditional random field model according to an embodiment of the present invention.

도 11에 도시한 바와 같이, 본 발명의 일 실시예에 따른 완전 연결된 조건부 랜덤 필드 모델은 완전 컨볼루션 뉴럴 네트워크의 성능을 효과적으로 향상시키는 것을 알 수 있다. As shown in FIG. 11, it can be seen that the fully connected conditional random field model according to the embodiment of the present invention effectively improves the performance of the fully convolutional neural network.

특히, 완전 연결된 조건부 랜덤 필드 모델은 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크의 성능을 0.7% 향상시켰으며, VGG-16 기반 완전 컨볼루션 뉴럴 네트워크의 성능을 0.8% 향상시켰다.In particular, the fully connected conditional random field model improves the performance of the ResNet-101 based fully convolutional neural network by 0.7% and improves the performance of the VGG-16 based fully convolutional neural network by 0.8%.

이 결과는 픽셀-레벨의 객체 위치 파악과 객체 경계에 대한 복구 문제를 해결하는 데 있어, 완전 연결된 조건부 랜덤 필드 모델의 이점을 명확하게 나타낸다.This result clearly demonstrates the benefits of a fully connected conditional random field model in solving pixel-level object localization and recovery of object boundaries.

따라서 본 발명의 완전 컨볼루션 뉴럴 네트워크는 상기 완전 연결된 조건부 랜덤 필드 모델과 결합됨으로써, 더욱더 정확하고 신속하게 시맨틱 이미지 세그먼테이션을 수행할 수 있도록 한다.Thus, the fully convolutional neural network of the present invention is combined with the fully connected conditional random field model, thereby making it possible to perform semantic image segmentation more accurately and quickly.

이하에서는 도 12 내지 도 14를 참조하여 본 발명의 완전 컨볼루션 뉴럴 네트워크를 통해 시맨틱 이미지 세그먼테이션을 수행한 결과를 설명하도록 한다.Hereinafter, the results of performing semantic image segmentation through the full convolutional neural network of the present invention will be described with reference to FIGS. 12 to 14.

도 12는 본 발명의 일 실시예에 따른 사람과 사물의 상호작용과 사람과 사람의 상호작용을 포함하는 액티비티를 강조하는 시맨틱 이미지의 세그먼테이션의 일례를 나타낸 도면이고, 도 13은 본 발명의 일 실시예에 따른 동물에 대한 시맨틱 이미지의 세그먼테이션에 대한 일례를 나타낸 도면이고, 도 14는 발명의 일 실시예에 따른 객체에 대한 시맨틱 이미지의 세그먼테이션에 대한 일례를 나타낸 도면이다.12 is a diagram illustrating an example of segmentation of a semantic image emphasizing an activity including a person-object interaction and a person-person interaction according to an embodiment of the present invention, and FIG. 13 is an embodiment of the present invention. FIG. 14 is a diagram illustrating an example of segmentation of a semantic image of an animal according to an example, and FIG. 14 is a diagram illustrating an example of segmentation of a semantic image of an object according to an embodiment of the present invention.

시맨틱 이미지 세그먼테이션을 위한 이미지는 PASCAL VOC 2012 데이터 세트에서 수집되었다.Images for semantic image segmentation were collected from the PASCAL VOC 2012 data set.

또한 도 12 내지 도 14에 도시한 바와 같이, 멀티 스케일 이미지 입력과 멀티 스케일 확장 컨볼루션을 기반으로 하는 본 발명의 완전 컨볼루션 뉴럴 네트워크를 통해 특정 이미지에 대한 시맨틱 이미지 세그먼테이션의 결과는, 상기 특정 이미지로부터 의미 있는 부분(즉, 객체)을 정확하게 예측함으로써, 상기 시맨틱 이미지 세그먼테이션을 정확하게 수행할 수 있음을 보여준다.12 to 14, the result of the semantic image segmentation for a specific image through the fully convolutional neural network of the present invention based on multi-scale image input and multi-scale extended convolution, the specific image By accurately predicting the meaningful part (ie the object) from the above, it is shown that the semantic image segmentation can be performed accurately.

도 15a는 본 발명의 일 실시예에 따른 흑색종 검출을 위한 피부 병변 데이터 세트를 이용하여, 완전 컨볼루션 뉴럴 네트워크와 타 방법과의 성능을 비교한 도면이다.FIG. 15A is a diagram comparing the performance of a complete convolutional neural network with other methods using a skin lesion data set for melanoma detection according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 흑색종 검출을 위한 피부 병변 분석의 데이터 세트를 이용하여 본 발명에서 제안하고 있는 완전 컨볼루션 뉴럴 네트워크와 타 방법 간의 시맨틱 세그먼테이션의 정확성을 평가하였다.The data set of skin lesion analysis for melanoma detection according to an embodiment of the present invention was used to evaluate the accuracy of the semantic segmentation between the complete convolutional neural network and other methods proposed in the present invention.

도 15a에 나타낸 것과 같이, ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크가 VGG-16 기반 완전 컨볼루션 뉴럴 네트워크보다 훨씬 더 높은 정확도를 가지는 것을 알 수 있다.As shown in FIG. 15A, it can be seen that the ResNet-101 based full convolutional neural network has much higher accuracy than the VGG-16 based full convolutional neural network.

즉, ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크는 83.5의 IoU 스코어를 획득하였으면, 이는 다른 방법들 중에서 가장 좋은 결과이다.That is, if the ResNet-101 based fully convolutional neural network has obtained an IoU score of 83.5, this is the best result among other methods.

또한 이것은 본 발명의 완전 컨볼루션 뉴럴 네트워크가 크기와 모양이 변하는 흑색종을 탐지하는데 효과적이라는 것을 알 수 있다.It can also be seen that the fully convolutional neural network of the present invention is effective in detecting melanoma of varying size and shape.

이는, 상기 완전 컨볼루션 뉴럴 네트워크가 멀티 스케일 이미지 입력 및 멀티 스케일 확장 컨볼루션으로부터 멀티 스케일 특징을 효율적으로 활용할 수 있기 때문이다.This is because the full convolutional neural network can efficiently utilize multi-scale features from multi-scale image input and multi-scale extension convolution.

또한 완전 연결된 조건부 랜덤 필드 모델은 ResNet-101 기반 완전 컨볼루션 뉴럴 네트워크에 0.4%의 추가적인 성능 개선 효과를 제공함을 알 수 있다.The fully connected conditional random field model also shows an additional 0.4% performance improvement for ResNet-101 based fully convolutional neural networks.

도 15b는 본 발명의 일 실시예에 따른 완전 컨볼루션 뉴럴 네트워크를 통해 흑색종 검출을 위한 피부 병변 데이터 세트의 시맨틱 이미지 세그먼테이션 결과를 나타낸 도면이다. FIG. 15B is a diagram illustrating semantic image segmentation results of a skin lesion data set for melanoma detection through a full convolutional neural network according to an embodiment of the present invention. FIG.

도 15b에 도시한 바와 같이, 본 발명의 일 실시예에 따른 완전 컨볼루션 뉴럴 네트워크는 흑색종의 복잡한 변화와 인공물의 존재를 포함한 피부 병변 이미지로부터 시맨틱 이미지 세그먼테이션에 대한 높은 성능을 보여줌을 알 수 있다. As shown in FIG. 15B, it can be seen that the fully convolutional neural network according to an embodiment of the present invention shows high performance on semantic image segmentation from skin lesion images including complex changes of melanoma and the presence of artifacts. .

도 16은 본 발명의 일 실시예에 따른 시맨틱 이미지 세그먼테이션을 수행하는 절차를 나타낸 흐름도이다.16 is a flowchart illustrating a procedure of performing semantic image segmentation according to an embodiment of the present invention.

도 16에 도시한 바와 같이, 본 발명의 일 실시예에 따른 시맨틱 이미지 세그먼테이션을 수행하는 절차는 우선, 시맨틱 이미지 세그먼테이션 장치(100)는 학습데이터 데이터베이스(310)로부터 학습데이터를 로딩한 후, 해당 학습데이터에 대한 전처리과정을 수행한다(S110).As shown in FIG. 16, in the procedure of performing semantic image segmentation according to an embodiment of the present invention, the semantic image segmentation apparatus 100 first loads training data from the training data database 310 and then performs corresponding training. Performs a preprocessing process for the data (S110).

즉, 상기 시맨틱 이미지 세그먼테이션 장치(100)는 전처리과정을 통해 미리 설정한 스케일링 비율에 따라 학습데이터를 구성하는 원본 이미지에 대한 멀티 스케일 이미지를 생성하는 것이다.That is, the semantic image segmentation apparatus 100 generates a multi-scale image of the original image constituting the training data according to a scaling ratio preset through a preprocessing process.

다음으로 상기 시맨틱 이미지 세그먼테이션 장치(100)는 상기 생성한 멀티 스케일 이미지를 학습하여, 시맨틱 이미지 세그먼테이션용 학습모델을 생성한다(S120).Next, the semantic image segmentation apparatus 100 learns the generated multi-scale image and generates a learning model for semantic image segmentation (S120).

상기 학습은 멀티 스케일 이미지 입력과 확장된 컨볼루션을 기반으로 하는 완전 컨볼루션 뉴럴 네트워크를 통해 수행됨은 상술한 바와 같다. 또한 상기 완전 컨볼루션 뉴럴 네트워크는 본 발명의 핵심적인 기술적 특징으로써, 상기 완전 컨볼루션 뉴럴 네트워크에 대한 구조는 도 5를 참조하여 설명하였으므로 여기서는 상세한 설명을 생략하도록 한다.As described above, the learning is performed through a full convolutional neural network based on multi-scale image input and extended convolution. In addition, the full convolutional neural network is a key technical feature of the present invention. Since the structure of the full convolutional neural network has been described with reference to FIG. 5, a detailed description thereof will be omitted.

다음으로 사용자로부터 시맨틱 이미지 세그먼테이션을 위한 특정 이미지를 입력받은 경우(S130), 시맨틱 이미지 세그먼테이션 장치(100)는 상기 S110단계와 동일한 전처리 과정을 수행하여 상기 입력되는 특정 이미지에 대한 멀티 스케일 이미지를 생성한다(S140).Next, when a specific image for semantic image segmentation is input from the user (S130), the semantic image segmentation apparatus 100 performs the same preprocessing process as in step S110 to generate a multi-scale image of the input specific image. (S140).

다음으로 시맨틱 이미지 세그먼테이션 장치(100)는 학습모델 데이터베이스(320)로부터 시맨틱 이미지 세그먼테이션용 학습모델을 로딩하여, 상기 생성한 특정 이미지의 멀티 스케일 이미지를 상기 시맨틱 이미지 세그먼테이션용 학습모델에 적용함으로써, 해당 특정 이미지에 대한 시맨틱 이미지 세그먼테이션을 수행하고(S150), 해당 결과를 상기 사용자에게 제공한다(S160).Next, the semantic image segmentation apparatus 100 loads a learning model for semantic image segmentation from the training model database 320, and applies the multi-scale image of the generated specific image to the semantic image segmentation learning model, thereby specifying the specific model. The semantic image segmentation on the image is performed (S150), and the result is provided to the user (S160).

이상에서 설명한 바와 같이 본 발명은 멀티 스케일 이미지와 멀티 스케일 확장 컨볼루션 기반의 완전 컨볼루션 뉴럴 네트워크를 이용한 시맨틱 이미지 세그먼테이션 장치 및 그 방법에 관한 것으로, 멀티 스케일 이미지 입력과 확장된 컨볼루션의 캐스케이드 아키텍처를 기반으로 하는 완전 컨볼루션 뉴럴 네트워크를 통해 특정 이미지에 대한 시맨틱 세그먼테이션을 정확하고 신속하게 수행할 수 있도록 하는 효과가 있다.As described above, the present invention relates to a semantic image segmentation apparatus and a method using a full-convolution neural network based on multi-scale image and multi-scale extended convolution. The present invention relates to a cascade architecture of multi-scale image input and extended convolution. The complete convolutional neural network on which it is based has the effect of enabling accurate and fast semantic segmentation on specific images.

또한 본 발명은 상기 완전 컨볼루션 뉴럴 네트워크와 조건부 랜덤 필드 모델을 결합하여, 시맨틱 이미지 세그먼테이션의 정확성을 현저하게 향상시킬 수 있도록 하는 효과가 있다.In addition, the present invention combines the full convolutional neural network with a conditional random field model, thereby effectively improving the accuracy of semantic image segmentation.

또한 상기에서는 본 발명에 따른 바람직한 실시 예를 위주로 상술하였으나 본 발명의 기술적 사상은 이에 한정되는 것은 아니며 본 발명의 각 구성요소는 동일한 목적 및 효과의 달성을 위하여 본 발명의 범위 내에서 변경 또는 수정될 수 있을 것이다.In addition, in the above, the present invention has been described above with reference to a preferred embodiment, but the technical idea of the present invention is not limited thereto, and each component of the present invention may be changed or modified within the scope of the present invention to achieve the same object and effect. Could be.

아울러 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형 실시가 가능한 것은 물론이고, 이러한 변형 실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안 될 것이다.In addition, while the above has been shown and described with respect to the preferred embodiment of the present invention, the present invention is not limited to the specific embodiments described above, in the technical field to which the invention belongs without departing from the spirit of the invention claimed in the claims Various modifications can be made by those skilled in the art, and these modifications should not be individually understood from the technical spirit or the prospect of the present invention.

100 : 시맨틱 이미지 세그먼테이션 장치 110 : 이미지 수집부
120 : 전처리부 130 : 시맨틱 이미지 세그먼테이션용 학습모델 생성부
140 : 시맨틱 이미지 세그먼테이션부
150 : 시맨틱 이미지 세그먼테이션 결과 제공부
160 : 데이터베이스 인터페이스부 170 : 제어부
200 : 사용자 단말 300 : 데이터베이스
310 : 학습데이터 데이터베이스 320 : 학습모델 데이터베이스100: semantic image segmentation device 110: image collection unit
120: pre-processing unit 130: semantic image segmentation learning model generation unit
140: semantic image segmentation unit
150: semantic image segmentation result providing unit
160: database interface unit 170: control unit
200: user terminal 300: database
310: training data database 320: training model database

Claims

A preprocessor for preprocessing learning data to generate a multi-scale image of the original image of the learning data;
And a learning model generator for semantic image segmentation that generates the learning model for semantic image segmentation by learning the generated multi-scale image.
The learning is performed through a full convolutional neural network comprising a plurality of sub-networks consisting of a convolutional neural network with multi-scale images as input and a multi-scale extended convolutional cascade architecture. .

The method according to claim 1,
The semantic image segmentation device
Recovering the resolution of the feature map output from each of the sub-networks through the multi-scale extended convolution cascade architecture to generate a high-resolution final feature map,
And re-adjust the generated final feature maps to the same resolution.

The method according to claim 2,
The multi-scale extended convolution cascade architecture,
It includes a plurality of extended convolution layers connected in cascade form to perform extended convolution,
Wherein each extended convolutional layer has a different rate, but has a higher rate than a previous extended convolutional layer.

The method according to claim 1,
The full convolutional neural network,
A max out layer integrating all feature maps output from the multi-scale extended convolution cascade architecture to extract a boundary for an object; And
And a fully connected conditional random field model for restoring the boundary for the extracted object.

The method according to claim 1,
The semantic image segmentation device,
When a specific image for semantic image segmentation is input, the preprocessor controls the pre-processing unit to generate a multi-scale image of the specific image, and applies the generated multi-scale image to the semantic image segmentation learning model. And a semantic image segmentation unit for performing semantic image segmentation for the semantic image segmentation apparatus.

A preprocessing step of preprocessing the training data to generate a multi-scale image of the original image of the training data;
And generating a learning model for semantic image segmentation by learning the generated multi-scale image to generate a learning model for semantic image segmentation.
The learning is performed through a full convolutional neural network comprising a plurality of sub-networks consisting of a convolutional neural network with multi-scale images as input and a multi-scale extended convolutional cascade architecture. .

The method according to claim 6,
The semantic image segmentation method is
Recovering the resolution of the feature map output from each of the sub-networks through the multi-scale extended convolution cascade architecture to generate a high-resolution final feature map,
And re-adjust the generated final feature maps to the same resolution.

The method according to claim 7,
The multi-scale extended convolution cascade architecture,
It includes a plurality of extended convolution layers connected in cascade form to perform extended convolution,
Wherein each extended convolutional layer has a different rate, but has a higher rate than a previous extended convolutional layer.

The method according to claim 6,
The full convolutional neural network,
A max out layer integrating all feature maps output from the multi-scale extended convolution cascade architecture to extract a boundary for an object; And
And a fully connected conditional random field model for restoring the boundary for the extracted object.

The method according to claim 6,
The semantic image segmentation method,
When a specific image for semantic image segmentation is input, the preprocessor controls the pre-processing unit to generate a multi-scale image of the specific image, and applies the generated multi-scale image to the semantic image segmentation learning model. And a semantic image segmentation step of performing semantic image segmentation on the semantic image segmentation method.