KR20180130869A

KR20180130869A - CNN For Recognizing Hand Gesture, and Device control system by hand Gesture

Info

Publication number: KR20180130869A
Application number: KR1020170067019A
Authority: KR
Inventors: 전은솜; 문일현; 권재철
Original assignee: 주식회사 케이티
Priority date: 2017-05-30
Filing date: 2017-05-30
Publication date: 2018-12-10
Also published as: KR102343963B1

Abstract

The present invention relates to a convolution neural network for detecting a hand gesture, and a device control system by a hand gesture. According to the present invention, a gesture classifier for learning a parameter of a convolution neural network for detecting a hand gesture, comprises: a convolution neural network including a plurality of convolution layers for calculating a feature map by performing a convolution operation, and a complete connection layer for classifying a detected image by analyzing feature maps calculated from the convolution layers; and a learning engine for learning the convolution neural network to calculate a parameter optimized for detecting a hand gesture. The convolution layers include: a first convolution layer including a subsampling layer for reducing a size of the feature map calculated as a result of a convolution calculation on the basis of the detected image; a second convolution layer realized by a non-subsampling layer, and repeating the convolution operation on the basis of an output of the first convolution layer; and a third convolution layer including a subsampling layer for reducing a size of the feature map calculated as a result of the convolution calculation on the basis of an output of the second convolution layer. According to the present invention, a type, number, and size of a kernel filter performing a convolution operation are independently configured for each of the convolution layers, and the type, number, and size of the kernel filter included in the convolution layers are individually calculated by learning of the learning engine.

Description

Technical Field [0001] The present invention relates to a hand gesture detecting device, a convolutional neural network detecting a hand gesture, and a device control system using a hand gesture,

본 발명은 손 제스처를 이용하여 기기를 제어하는 시스템에 관한 것으로, 구체적으로 손 제스처의 특징을 추출하는데 최적화된 컨볼루션 신경망(Convolutional Neural Network, 이하 "CNN") 구조를 설계하고, 상기 컨볼루션 신경망(CNN) 구조를 갖는 분류기를 이용하여 손 제스처를 분류하고 주변 기기를 제어하는 손 제스처에 의한 기기 제어시스템에 관한 것이다. The present invention relates to a system for controlling a device using a hand gesture, and more particularly, to a congestion neural network (hereinafter referred to as " CNN ") structure optimized for extracting characteristics of a hand gesture, The present invention relates to a device control system using a hand gesture that classifies hand gestures using a classifier having a CNN structure and controls peripheral devices.

최근 마우스나 키보드 등의 입력장치에서 벗어나 인간의 자연스러운 동작인 제스처(gesture)를 인식하고, 그 인식결과를 매개로 사용자와 컴퓨팅 기기 사이의 의사소통을 가능하게 하는 내추럴 사용자 인터페이스(Natural User Interface; NUI)에 대한 연구가 활발하다. A natural user interface (NUI) that recognizes a gesture, which is a natural movement of a human being, out of an input device such as a mouse or a keyboard and enables communication between a user and a computing device through the recognition result, ) Have been actively studied.

제스처를 인식하는 기술은 규칙기반 인식 기술과 학습기반 인식 기술 두 가지로 크게 구분할 수 있다. 규칙기반 인식 기술은 손바닥의 중심으로부터 일정한 임계값(Threshold)을 설정하고 임계값을 넘는 손 끝(Finger Tip)의 개수에 따라 손모양을 인식하는 방법이다. 학습기반 인식 기술은 인식 대상이 되는 손모양에 대한 DB를 취득하고 이를 학습하여 생성한 모델을 통해 손모양을 인식하는 방법이다.Gesture recognition techniques can be broadly classified into two categories: rule-based recognition technology and learning-based recognition technology. Rule-based recognition technology is a method of setting a certain threshold value from the center of the palm of a hand and recognizing the hand shape according to the number of finger tips over the threshold value. Learning - based recognition technology is a method of acquiring a DB of the hand shape to be recognized and recognizing the hand shape through the model generated by learning it.

규칙기반 인식 기술은 사람마다 손 크기가 다르기 때문에 최적의 임계값(r)을 결정하는 데 어려움이 있다. 환경 변화가 생기는 경우에는, 최적의 임계값(r)을 설정하기 위하여 임계값을 재설정해야 하는 경우가 발생할 수 있으며, 결정된 임계값(r)이 최적의 임계값이 아닌 경우에는 인식률이 낮아져 성능이 저하되는 문제가 발생할 수도 있다. 그리고 규칙기반 인식 기술은 학습기반 인식 기술에 비하여 다양한 손모양을 인식하는 데 한계가 있다. Rule-based recognition techniques have difficulty in determining the optimal threshold value (r) because the hand size varies from person to person. When an environment change occurs, a threshold value may need to be reset in order to set an optimal threshold value (r). If the determined threshold value (r) is not an optimal threshold value, The problem may be degraded. And rule - based recognition technology has limitations in recognition of various hand shapes compared to learning - based recognition technology.

학습기반 인식 기술은 제스처를 정확하게 분류해낼 수 있도록 설계된 학습 구조에 의해 복수의 데이터를 군집화하거나 분류하는 딥러닝(Deep Learning)에 기반한 기술이다. 특히, 객체 인식(object recognition) 분야에서는 딥러닝의 일종인 컨볼루션 신경망(Convolutional Neural Network, 이하 "CNN")이라는 기술이 각광받고 있으며, 컨볼루션 신경망(CNN)은 사람이 물체를 인식할 때 물체의 기본적인 특징들을 추출한 다음 뇌 속에서 복잡한 계산을 거쳐 그 결과를 기반으로 물체를 인식한다는 가정을 기반으로 만들어진 사람의 뇌 기능을 모사한 모델이다. 컨볼루션 신경망(CNN)에서는 기본적으로 컨볼루션(convolution) 연산을 통해 영상의 특징을 추출하기 위한 다양한 필터와 비선형적인 특성을 더하기 위한 풀링(pooling) 또는 비선형 활성화(non-linear activation) 함수 등이 함께 사용된다. Learning-based cognitive technology is a technology based on deep learning that clusters or classifies a plurality of data by a learning structure designed to classify gestures accurately. Particularly, in the field of object recognition, a technology called Convolutional Neural Network (CNN), which is a kind of deep learning, is attracting attention. Convolution Neural Network (CNN) Is a model that simulates the human brain function based on the assumption that the basic features of the brain are extracted and then subjected to complex calculations in the brain to recognize objects based on the results. Convolutional neural networks (CNN) basically include various filters for extracting image features through convolution operations, and pooling or non-linear activation functions to add nonlinear characteristics. Is used.

그러나, 이러한 신경망 기술을 사용함에 있어서도, 적용되는 함수의 종류 및 연산의 구조를 어떻게 설계하는가에 따라 성능 결과가 첨예하게 달라진다. 따라서, 컨볼루션 신경망(CNN)을 목적에 맞게 적절하게 설계하는 것은 성능과 직결되는 매우 중요한 문제이다.However, even when using such a neural network technique, the performance result is rapidly changed depending on the type of applied function and how the structure of the computation is designed. Therefore, designing the convolutional neural network (CNN) appropriately for the purpose is a very important problem directly related to performance.

한국 공개특허공보 제10-2010-0129629호 "움직임 검출에 의한 전자장치 동작 제어방법 및 이를 채용하는 장치"Korean Patent Laid-Open Publication No. 10-2010-0129629 " Method of controlling electronic device operation by motion detection and apparatus employing the same "

앞서 본 종래 기술의 문제점을 해결하기 위하여 안출된 것으로, SUMMARY OF THE INVENTION Accordingly, the present invention has been made keeping in mind the above problems occurring in the prior art,

본 발명의 목적은, 다양한 손모양 및 손 제스처를 분류하는데 최적화된 컨볼루션 신경망(CNN) 구조를 설계하고, 설계된 컨볼루션 신경망(CNN)을 학습시켜 각종 파라미터를 자동으로 추출하는 손 제스처에 의한 기기 제어시스템을 제공하는 것이다. It is an object of the present invention to provide a hand gesture device that designs a convolutional neural network (CNN) structure optimized for classifying various hand shapes and hand gestures and automatically learns various parameters by learning a designed convolution neural network (CNN) And to provide a control system.

본 발명의 다른 목적은, 손 제스처를 분류하는데 최적화된 컨볼루션 신경망(CNN)과 학습으로 추출된 파라미터로 구성된 분류기를 이용함으로써, 원거리 비접촉에 의한 손 제스처도 정확하게 분류하여 손 제스처에 의한 기기 제어 성능을 높이는 손 제스처에 의한 기기 제어시스템을 제공하는 것이다. Another object of the present invention is to accurately classify hand gestures by remote non-contact by using a convolutional neural network (CNN) optimized for classifying hand gestures and a classifier composed of parameters extracted by learning, And to provide a device control system based on a hand gesture for increasing a hand gesture.

본 발명의 또 다른 목적은, 고정된 위치나 기지정된 제어영역, 또는 제어할 기기가 이미 설정되어 있는 것이 아닌, 사용자가 스스로 원하는 위치와 원하는 제어영역을 설정하고 제어하고자 하는 주변기기 및 제어신호 또한 설정할 수 있는 손 제스처에 의한 기기 제어시스템을 제공하는 것이다. It is a further object of the present invention to provide an apparatus and method for setting and controlling a desired position and a desired control area of a user, A hand gesture capable of controlling a hand gesture.

일 측면에 따른 제스처 분류기는, 손 제스처 검출 컨볼루션 신경망의 파라미터를 학습하는 제스처 분류기에 있어서, 컨볼루션 연산을 수행하여 특징맵을 산출하는 복수의 컨볼루션 레이어들과 상기 복수의 컨볼루션 레이어들에서 산출된 특징맵들을 분석하여 검출영상을 분류하는 완전 연결 레이어로 구성되는 컨볼루션 신경망; 및 상기 컨볼루션 신경망을 학습시켜 손 제스처 검출에 최적화된 파라미터를 산출하는 학습엔진;을 포함하고, 상기 복수의 컨볼루션 레이어들은, 검출영상을 기초로 컨볼루션 연산결과 산출된 특징맵의 크기를 줄이는 서브 샘플링 레이어를 포함하는 제1 컨볼루션 레이어; 비서브 샘플링 레이어로 구현되어 상기 제1 컨볼루션 레이어의 아웃풋을 기초로 컨볼루션 연산을 반복하는 제2 컨볼루션 레이어; 상기 제2 컨볼루션 레이어의 아웃풋을 기초로 컨볼루션 연산결과 산출된 특징맵의 크기를 줄이는 서브 샘플링 레이어를 포함하는 제3 컨볼루션 레이어;를 포함하고, 컨볼루션 연산을 수행하는 커널필터의 종류, 개수, 크기는 상기 복수의 컨볼루션 레이어마다 독립적으로 구성되며 상기 학습엔진의 학습으로 상기 복수의 컨볼루션 레이어에 포함되는 커널필터의 종류, 개수, 크기가 개별적으로 산출되는 것을 특징으로 한다. A gesture classifier according to one aspect is a gesture classifier that learns parameters of a hand gesture detection convolutional neural network, comprising: a plurality of convolution layers for performing a convolution operation to calculate a feature map; A convolution neural network consisting of a completely connected layer for classifying the detected images by analyzing the calculated feature maps; And a learning engine that learns the convolutional neural network and calculates parameters optimized for hand gesture detection, wherein the plurality of convolutional layers reduce the size of the feature map calculated as a result of the convolution operation based on the detected image A first convolution layer comprising a subsampling layer; A second convolution layer, implemented as a non-subsampling layer, that repeats the convolution operation based on the output of the first convolution layer; And a third convolution layer including a sub-sampling layer for reducing the size of the feature map calculated as a result of the convolution operation based on the output of the second convolution layer, wherein the type of the kernel filter performing the convolution operation, The number and size of the kernel filters are independently configured for each of the plurality of convolution layers, and the type, number, and size of the kernel filters included in the plurality of convolution layers are individually calculated by learning of the learning engine.

상기 제1 컨볼루션 레이어 및 제3 컨볼루션 레이어는, 컨볼루션 연산 수행시 아웃풋의 크기가 인풋의 크기와 동일하게 유지되도록 패딩을 함께 수행하고, 상기 패딩의 파라미터는, 상기 복수의 컨볼루션 레이어 마다 독립적으로 구성되며 상기 학습엔진의 학습으로 패딩의 파라미터 크기는 상기 복수의 컨볼루션 레이어 마다 개별적으로 산출되는 것을 특징으로 한다. Wherein the first convolution layer and the third convolution layer together perform padding such that the size of the output is kept equal to the size of the input when performing the convolution operation and the parameter of the padding is determined for each of the plurality of convolution layers Characterized in that the parameter size of the padding is calculated separately for each of the plurality of convolution layers by learning of the learning engine.

상기 제1 컨볼루션 레이어 및 제3 컨볼루션 레이어는, 컨볼루션 연산으로 산출된 특징맵에 대한 정규화와 비선형함수 적용을 차례로 수행한 아웃풋을 서브 샘플링 레이어로 전달하는 것을 특징으로 한다. The first convolution layer and the third convolution layer transmit the output, which is obtained by successively performing normalization and nonlinear function application to the feature map calculated by the convolution operation, to the subsampling layer.

상기 제2 컨볼루션 레이어는, 컨볼루션 연산 수행시 아웃풋의 크기가 인풋의 크기와 동일하게 유지되도록 패딩을 함께 수행하고, 컨볼루션 연산으로 산출된 특징맵에 대한 정규화와 비선형함수 적용을 차례로 수행하여 아웃풋을 산출하는 복수의 컨볼루션 레이어들을 포함하며, 상기 제2 컨볼루션 레이어를 구성하는 복수의 컨볼루션 레이어들은 직렬결합으로 구성되는 것을 특징으로 한다. The second convolution layer performs padding so that the size of the output is kept equal to the size of the input when the convolution operation is performed, normalizes the feature map calculated by the convolution operation, and applies the nonlinear function in order And a plurality of convolution layers for calculating an output, wherein a plurality of convolution layers constituting the second convolution layer are constituted by series coupling.

상기 제2 컨볼루션 레이어를 구성하는 복수의 컨볼루션 레이어의 개수는, 상기 학습엔진의 학습으로 산출되는 것을 특징으로 한다. And the number of the plurality of convolution layers constituting the second convolution layer is calculated by the learning of the learning engine.

상기 제2 컨볼루션 레이어는, 상기 제1 컨볼루션 레이어의 아웃풋에 대한 컨볼루션 연산으로 산출된 특징맵에 정규화를 수행하여 제1 특징맵을 산출하는 제1 병렬 레이어; 상기 제1 컨볼루션 레이어의 아웃풋에 대한 컨볼루션 연산으로 산출된 특징맵에 정규화와 비선형함수 적용을 차례로 수행하는 제1 레이어와, 상기 제1 레이어의 아웃풋에 대한 컨볼루션 연산으로 산출된 특징맵에 대한 정규화를 적용하여 제2 특징맵을 산출하는 제2 레이어를 포함하고 상기 제1 레이어와 제2 레이어는 직렬결합으로 구성되는 제2 병렬 레이어; 상기 제1 특징맵과 상기 제2 특징맵에 대해 합 연산을 수행하는 퓨전 레이어; 및 상기 퓨전 레이어의 아웃풋에 대해 비선형함수 적용을 수행하는 노이즈 감소 레이어;를 포함하는 것을 특징으로 한다. Wherein the second convolution layer includes: a first parallel layer for performing a normalization on a feature map calculated by a convolution operation on an output of the first convolution layer to calculate a first feature map; A first layer that sequentially performs normalization and nonlinear function application on the feature map calculated by the convolution operation on the output of the first convolution layer and a feature map that is calculated by convolution operation on the output of the first layer A second parallel layer including a second layer for calculating a second feature map by applying normalization to the first layer and the second layer, the first layer and the second layer being formed by serial combination; A fusion layer for performing a sum operation on the first feature map and the second feature map; And a noise reduction layer for applying a non-linear function to the output of the fusion layer.

상기 제2 레이어는, 컨볼루션 연산 수행시 아웃풋의 크기가 인풋의 크기와 동일하게 유지되도록 패딩을 함께 수행하고, 상기 패딩의 파라미터는, 상기 학습엔진의 학습으로 패딩의 파라미터 크기는 상기 복수의 컨볼루션 레이어 마다 개별적으로 산출되는 것을 특징으로 한다. Wherein the second layer performs padding so that the size of the output is kept equal to the size of the input when performing a convolution operation, and the parameter of the padding is determined by learning of the learning engine, And is calculated separately for each of the effect layer.

다른 측면에 따른 제스처 인식 장치는, 손 제스처를 인식하여 주변기기를 제어하는 제스처 인식 장치에 있어서, 사용자 지정 원격위치에서의 제스처 표시영역을 검증하고 상기 제스처 표시영역에서의 제스처 움직임에 대한 좌표가 모니터상 대응되는 좌표로 표시되도록 상기 제스처 표시영역과 모니터 상호 간 대응관계를 연산하는 제스처 사용 검증부; 사용자 지정으로 제어할 기기와 제어신호로 사용되는 제스처를 등록하는 제스처 등록부; 및 상기 제스처 표시영역에서 검출된 제스처 영상에서 제스처를 검출하고 분석하여 상기 제스처 등록부에 의해 등록된 제스처에 대응되는 제어명령에 따라 기기를 제어하는 기기 제어부;를 포함하고 상기 제스처 표시영역은, 사용자가 손 제스처 정보를 전달하는 사용자 지정 영역으로 모니터와 일정거리 이격된 위치에서 사용자 정의에 의해 생성되는 것을 특징으로 한다. A gesture recognition apparatus according to another aspect is a gesture recognition apparatus for recognizing a hand gesture and controlling a peripheral device, the gesture recognition apparatus comprising: a verification unit for verifying a gesture display area at a user- A gesture usage verification unit operable to calculate a corresponding relationship between the gesture display area and the monitor so as to display the corresponding coordinates; A gesture registration unit for registering a device to be controlled by a user and a gesture used as a control signal; And a device controller for detecting and analyzing a gesture in the gesture image detected in the gesture display area and controlling the device in accordance with a control command corresponding to the gesture registered by the gesture registering unit, And a user-defined area for transmitting hand gesture information. The user-defined area is generated at a position spaced a certain distance from the monitor.

상기 제스처 표시영역과 모니터 상호 간 대응관계는, 모니터상에 표시된 모니터좌표와, 상기 모니터좌표를 따라 사용자가 제스처 표시영역에 표시한 손 영역좌표에 대한 영상을 분석하여 추출된 기준좌표를 기초로 산출되는 것을 특징으로 한다. The correspondence relation between the gesture display area and the monitor is calculated by analyzing the monitor coordinates displayed on the monitor and the image of the hand area coordinates displayed on the gesture display area by the user along the monitor coordinates and based on the extracted reference coordinates .

상기 기기 제어부는, 제스처 영상에서 제스처를 검출하고 제스처 종류를 분석하는 제스처 분류기를 포함하고, 상기 제스처 분류기는, 학습된 파라미터를 포함하는 제스처 검출 컨볼루션 신경망을 이용하여 구현되는 것을 특징으로 한다. The apparatus control unit includes a gesture classifier for detecting a gesture in a gesture image and analyzing a gesture type, and the gesture classifier is implemented using a gesture detection convolutional neural network including learned parameters.

상기 제스처 분류기는, 컨볼루션 연산을 수행하여 특징맵을 산출하는 복수의 컨볼루션 레이어들과 상기 복수의 컨볼루션 레이어들이 산출한 특징맵들을 분석하여 검출영상을 분류하는 완전 연결 레이어로 구성되는 컨볼루션 신경망; 및 상기 컨볼루션 신경망을 학습시켜 손 제스처 검출에 최적화된 파라미터를 산출하는 학습엔진;을 포함하고, 상기 복수의 컨볼루션 레이어들은, 검출영상을 기초로 산출된 특징맵의 크기를 줄이는 서브 샘플링 레이어를 포함하는 제1 컨볼루션 레이어; 비서브 샘플링 레이어를 포함하여 상기 제1 컨볼루션 레이어의 아웃풋을 기초로 컨볼루션 연산을 반복하는 제2 컨볼루션 레이어; 상기 제2 컨볼루션 레이어의 아웃풋을 기초로 산출된 특징맵의 크기를 줄이는 서브 샘플링 레이어를 포함하는 제3 컨볼루션 레이어;를 포함하고, 컨볼루션 연산을 수행하는 커널필터의 종류, 개수, 크기는 상기 복수의 컨볼루션 레이어 마다 독립적으로 구성되며 상기 학습엔진의 학습으로 상기 복수의 컨볼루션 레이어에 포함되는 커널필터의 종류, 개수, 크기가 개별적으로 산출되는 것을 특징으로 한다. Wherein the gesture classifier comprises convolution layers for calculating a feature map by performing a convolution operation and convolution layers composed of a complete connection layer for classifying detected images by analyzing feature maps calculated by the plurality of convolution layers Neural network; And a learning engine that learns the convolutional neural network and calculates parameters optimized for hand gesture detection, and the plurality of convolution layers include a sub-sampling layer for reducing the size of the feature map calculated on the basis of the detected image A first convolution layer comprising: A second convolution layer that includes a non-subsampling layer and repeats the convolution operation based on the output of the first convolution layer; And a third convolution layer including a sub-sampling layer for reducing the size of the feature map calculated on the basis of the output of the second convolution layer. The type, number, and size of kernel filters for performing the convolution operation are And the type, number, and size of the kernel filters included in the plurality of convolution layers are independently calculated by learning the learning engine, independently of the plurality of convolution layers.

본 발명은 앞서 본 구성에 의하여 다음과 같은 효과를 가진다. The present invention has the following effects with the above-described configuration.

본 발명은, 손모양 및 손 제스처 분류에 최적화된 맞춤형 컨볼루션 신경망(CNN)의 설계구조를 제공하는 효과를 갖는다. The present invention has the effect of providing a design structure of a customized convolutional neural network (CNN) optimized for hand and hand gesture classification.

본 발명은, 손 제스처 맞춤형 컨볼루션 신경망(CNN)을 학습시켜 다양한 사람들의 손 모양이나 제스처로도 제어신호를 생성할 수 있도록 손 제스처 분류에 최적화된 각종 파라미터를 자동으로 추출할 수 있는 효과를 갖는다. The present invention has an effect of automatically extracting various parameters optimized for hand gesture classification so that a control signal can be generated by hand shapes or gestures of various people by learning a hand gesture customized convolution neural network (CNN) .

본 발명은, 손 제스처를 분류하는데 최적화된 컨볼루션 신경망(CNN)으로 구성된 분류기를 제공함으로써, 원거리 비접촉에 의한 손 제스처도 정확하게 분류함으로서 손 제스처에 의한 기기 제어 성능을 높이는 효과를 기대할 수 있다.The present invention provides a classifier composed of a convolutional neural network (CNN) optimized for classifying hand gestures, so that it is possible to classify the hand gestures by the distance noncontact accurately, thereby enhancing the device control performance by the hand gesture.

본 발명은, 사용자가 스스로 원하는 위치와 원하는 제어영역을 설정하고 제어하고자 하는 주변기기 및 제어신호 또한 자유롭게 설정할 수 있는 효과를 갖는다. The present invention has the effect of freely setting peripherals and control signals for setting and controlling a desired position and a desired control area by the user.

도 1은 일 실시예에 따른 제스처 분류기의 구성을 나타내는 블럭도이다.
도 2는 일 실시예에 따른 컨볼루션 신경망(CNN)의 구조를 도시한 도면이다.
도 3은 도 2의 컨볼루션 레이어에 의해 수행되는 컨볼루션 연산을 설명하는 개념도이다.
도 4는 도 2의 서브 샘플링 레이어에 의해 수행되는 풀링을 설명하는 개념도이다.
도 5는 도 2의 컨볼루션 신경망(CNN)에 대한 학습으로 파라미터 및 컨볼루션 레이어의 개수가 산출된 예시도이다.
도 6는 다른 실시예에 따른 컨볼루션 신경망(CNN)의 구조를 도시한 도면이다.
도 7은 도 6의 컨볼루션 신경망(CNN)에 대한 학습으로 파라미터가 산출된 예시도이다.
도 8은 도 6의 컨볼루션 신경망(CNN)의 구조에서 제2 컨볼루션 레이어가 복수개 직렬 연결된 예시도이다.
도 9는 실시예에 따라 컨볼루션 신경망(CNN)의 학습에 활용되는 학습영상의 예시도이다.
도 10는 실시예에 따른 컨볼루션 신경망(CNN) 구조를 이용하는 손 제스처 분류기가 도출한 최종 분류결과의 예시도이다.
도 11은 또 다른 실시예에 따른 제스처를 이용한 기기 제어 시스템을 보여주는 전체 개념도이다.
도 12는 도 11의 제스처 인식 장치를 설명하는 블럭도이다.
도 13은 실시예에 따라 제스처 검출의 예시를 보여주는 도면이다. 1 is a block diagram showing the configuration of a gesture classifier according to an embodiment.
2 is a diagram illustrating the structure of a convolutional neural network (CNN) according to an embodiment.
3 is a conceptual diagram illustrating a convolution operation performed by the convolution layer of FIG.
4 is a conceptual diagram illustrating pulling performed by the subsampling layer of FIG.
FIG. 5 is a diagram illustrating an example in which the number of parameters and convolutional layers is calculated by learning the convolutional neural network (CNN) of FIG. 2. FIG.
6 is a diagram showing the structure of a convolutional neural network (CNN) according to another embodiment.
FIG. 7 is a diagram illustrating parameters calculated by learning the convolutional neural network (CNN) of FIG. 6;
8 is an exemplary diagram illustrating a plurality of second convolution layers connected in series in the structure of the convolutional neural network CNN of FIG.
FIG. 9 is an illustration of a learning image used for learning of a convolutional neural network (CNN) according to an embodiment.
10 is an illustration of the final classification results derived by the hand gesture classifier using the convolutional neural network (CNN) structure according to an embodiment.
11 is an overall conceptual diagram showing a device control system using a gesture according to another embodiment.
12 is a block diagram illustrating the gesture recognition apparatus of Fig.
13 is a diagram showing an example of gesture detection according to an embodiment.

이하, 본 발명의 실시 예를 첨부된 도면들을 참조하여 더욱 상세하게 설명한다. 본 발명의 실시 예는 여러 가지 형태로 변형할 수 있으며, 본 발명의 범위가 아래의 실시 예들로 한정되는 것으로 해석되어서는 안 된다. 본 실시 예는 당업계에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해 제공되는 것이다. 또한, 본 발명의 도면과 명세서에서 특정한 용어들이 사용되었으나, 이는 단지 본 발명을 설명하기 위한 목적에서 사용된 것이지 의미 한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The embodiments of the present invention can be modified in various forms, and the scope of the present invention should not be construed as being limited to the following embodiments. This embodiment is provided to more fully describe the present invention to those skilled in the art. Furthermore, although specific terms have been used in the drawings and specification of the present invention, they have been used for the purpose of describing the present invention only and not for limiting the scope of the present invention described in the claims or the claims. Therefore, those skilled in the art will appreciate that various modifications and equivalent embodiments are possible without departing from the scope of the present invention. Accordingly, the true scope of the present invention should be determined by the technical idea of the appended claims.

그러면, 도면을 참고하여 본 발명의 손 제스처를 검출하는 컨볼루션 신경망, 그리고 손 제스처에 의한 기기 제어시스템에 대하여 상세하게 설명한다. Hereinafter, a convolution neural network for detecting a hand gesture according to the present invention and a device control system using a hand gesture will be described in detail with reference to the drawings.

도 1은 일 실시예에 따른 제스처 분류기의 구성을 나타내는 블럭도이다. 1 is a block diagram showing the configuration of a gesture classifier according to an embodiment.

도 1을 참고하면, 제스처 분류기(1)는 컨볼루션 신경망(11), 그리고 학습엔진(13)을 포함하며, 이러한 구성요소는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합을 통해서 구현될 수 있다. 또한, 제스처 분류기(1)는 메모리와 하나 이상의 프로세서를 포함할 수 있으며, 컨볼루션 신경망(11), 학습엔진(13)의 기능은 상기 메모리에 저장되어, 상기 하나 이상의 프로세서에 의하여 실행되는 프로그램 형태로 상기 제스처 분류기(1)에 구현될 수 있다.1, the gesture classifier 1 includes a convolutional neural network 11, and a learning engine 13, which may be implemented in hardware or software or through a combination of hardware and software . The function of the convolutional neural network 11, the learning engine 13 may also be stored in the memory so that the program type executed by the one or more processors In the gesture classifier 1 as shown in FIG.

컨볼루션 신경망(11)은, 학습엔진(13)에 의해 깊이 있게 학습 되며, 일 실시예에 따라, 손 제스처 영상을 정밀도 높게 인식할 수 있다. 일 실시예에 따른 컨볼루션 신경망(11)은, 객체 인식(object recognition) 분야에서의 딥러닝(deep learning)의 일종이며, 특히, 손 제스처 또는 손모양 인식하는데 최적화된 CNN(Convolutional Neural Network) 구조로 설계될 수 있다. The convolutional neural network 11 is learned in depth by the learning engine 13, and according to an embodiment, a hand gesture image can be recognized with high accuracy. Convolutional neural network 11 according to one embodiment is a kind of deep learning in the field of object recognition and is particularly applicable to CNN (Convolutional Neural Network) structure optimized for hand gesture or hand shape recognition . &Lt; / RTI >

학습엔진(13)은, 상기 컨볼루션 신경망(11)을 학습시켜 파라미터를 산출할 수 있다. 손은 손 벌림과 모아짐, 손의 빠른 이동, 회전, 손가락 모양의 다양한 변화를 취할 수 있고, 모양의 변화가 빠르고 크게 바뀔 수 있으며, 여러 가지 손 제스처를 동시에 활용하는 경우도 있다. 따라서, 실시예에 따른 컨볼루션 신경망(11) 구조를 제시하고, 상기 학습엔진(13)은 컨볼루션 신경망(11)을 학습시켜 손 제스처 검출에 최적화된 파라미터를 산출하여 다양한 손모양 또는 손 제스처를 정확하게 분류하는 컨볼루션 신경망(11) 구조를 완성할 수 있다. 여기서, 파라미터는, 필터(ex, 컨볼루션 연산을 수행하는 커널필터)의 종류, 개수, 크기뿐만 아니라, 이하 설명할 레이어의 개수 등도 포함한다. The learning engine 13 can learn the convolutional neural network 11 and calculate parameters. The hand can take various changes of the hand opening and gathering, the quick movement of the hand, the rotation, the finger shape, and the change of the shape can change quickly and largely, and there are cases where various hand gestures are used at the same time. Accordingly, the structure of the convolutional neural network 11 according to the embodiment is presented, and the learning engine 13 learns the convolutional neural network 11 to calculate parameters optimized for the hand gesture detection, thereby obtaining various hand shapes or hand gestures It is possible to complete the structure of the convolutional neural network 11 which correctly classifies it. Here, the parameter includes not only the type, number and size of the filter (ex, kernel filter performing convolution operation), but also the number of layers to be described below.

이하, 도 2 내지 도 8에서, 다양한 실시예에 따른 컨볼루션 신경망(11) 구조를 설명하고, 상기 컨볼루션 신경망(11) 구조에 최적화된 파라미터 예시를 설명한다. Hereinafter, in Figs. 2 to 8, the structure of the convolutional neural network 11 according to various embodiments will be described, and a parameter example optimized for the structure of the convolutional neural network 11 will be described.

도 2는 일 실시예에 따른 컨볼루션 신경망(CNN)의 구조를 도시한 도면이며, 도 3은 도 2의 컨볼루션 레이어에 수행되는 컨볼루션 연산을 설명하는 개념도이며, 도 4는 도 2의 서브 샘플링 레이어에 의해 수행되는 풀링을 설명하는 개념도이며, 도 5는 도 2의 컨볼루션 신경망(CNN)에 대한 학습으로 파라미터 및 컨볼루션 레이어의 개수가 산출된 예시도이다. FIG. 2 is a diagram illustrating a structure of a convolutional neural network (CNN) according to an embodiment. FIG. 3 is a conceptual diagram illustrating a convolution operation performed on the convolution layer of FIG. 2, FIG. 5 is an exemplary diagram illustrating the number of parameters and convolutional layers calculated by learning the convolutional neural network (CNN) of FIG. 2. FIG. 5 is a conceptual diagram illustrating pulling performed by the sampling layer.

도 2를 참고하면, 컨볼루션 신경망은, 컨볼루션 레이어(21), 그리고 완전 연결 레이어(23)를 포함한다. Referring to Fig. 2, the convolutional neural network includes a convolution layer 21, and a full connection layer 23. [

컨볼루션 레이어(21)는 컨볼루션 필터(또는 커널(kernel), 마스크(Mask))를 이용하여 입력된 영상에 컨볼루션 연산을 수행하고 특징맵(feature map)을 생성한다. 여기서, 컨볼루션 연산은 입력 영상 전 영역에서 가능한 모든 n×n 크기의 부분영역(또는 수용장)을 추출하고, 상기 n×n 크기의 부분영역의 각 값과 상기 부분영역의 크기에 대응하는 n×n 개의 파라미터로 구성되는 컨볼루션 필터의 각 단위 요소들을 각각 곱한 후 합산(즉, 필터와 부분영역 간의 내적 곱의 합)하는 것을 의미한다. 또한, 특징맵은 입력 영상의 다양한 특징이 표현된 영상 데이터를 의미하며, 산출된 특징맵의 개수는 컨볼루션 필터의 개수에 필수적으로 대응되는 것은 아니며 컨볼루션 연산의 방법에 따라 대응되지 않을 수 있다. The convolution layer 21 performs a convolution operation on the input image by using a convolution filter (or a kernel, a mask), and generates a feature map. In this case, the convolution operation extracts all n × n partial areas (or reception areas) that are possible in the entire area of the input image, and extracts n (n × n) (I.e., the sum of the inverse product between the filter and the partial area) after multiplying each unit element of the convolution filter composed of x n parameters. In addition, the feature map means image data in which various features of the input image are expressed, and the number of calculated feature maps does not necessarily correspond to the number of convolution filters and may not correspond according to the method of convolution operation .

컨볼루션 레이어(21)는, 복수의 컨볼루션 레이어(L₁, L₂, L₃, …, L_N)를 포함하고, 상기 복수의 컨볼루션 레이어(L₁, L₂, L₃, …, L_N)는 기능에 따라 제1 컨볼루션 레이어(211: L₁), 제2 컨볼루션 레이어(213: L₂, L₃, …, L_N _-1), 제3 컨볼루션 레이어(215: L_N)로 구별될 수 있다. The convolution layer 21 includes a plurality of convolution layers L ₁ , L ₂ , L ₃ , ..., L _N and a plurality of convolution layers L ₁ , L ₂ , L ₃ , L _N may have a first convolution layer 211 L ₁ , a second convolution layer 213 L ₂ , L ₃ , ... L _N _-1 , a third convolution layer 215 L _N ). &Lt; / RTI >

도 2를 참고하면, 일 실시예에 따른 컨볼루션 레이어(21)에서, 제1 컨볼루션 레이어(211) 및 제3 컨볼루션 레이어(215)는 서브 샘플링(subsampling) 또는 풀링(pooling)으로 특징맵의 크기를 줄이는 과정(POOL)을 수행하나, 제2 컨볼루션 레이어(213)는 특징맵의 크기를 줄이는 과정(POOL)을 수행하지 않는다. 따라서, 제1 컨볼루션 레이어(211)에서 컨볼루션 연산 및 풀링 과정 수행 이후, 제2 컨볼루션 레이어(213)에서는 풀링 과정 없이 컨볼루션 연산만 수차례 반복하여 아웃풋(output)인 특징맵의 수가 증가하도록 설계되어, 학습 및 분류하고자 하는 손 제스처 영상들이 갖는 각각의 특징들을 유지하면서 깊이 있는 학습이 가능하다. Referring to FIG. 2, in the convolution layer 21 according to an embodiment, the first convolution layer 211 and the third convolution layer 215 are sub-sampled or pooled, The second convolution layer 213 does not perform the process of reducing the size of the feature map (POOL). Accordingly, after the convolution operation and the pulling operation are performed in the first convolution layer 211, the second convolution layer 213 repeats the convolution operation only several times without performing the pulling operation to increase the number of feature maps, i.e., So that depth learning is possible while maintaining the characteristics of the hand gesture images to be learned and classified.

제1 컨볼루션 레이어(211)는, 분류하고자 하는 영상(이하, 검출영상)을 입력영상(인풋, input)으로 입력받아 컨볼루션 연산(CONV)을 수행하여 특징맵을 생성하는 컨볼루션 레이어(L1a) 및 샘플링(sampling)이나 풀링(pooling)을 통해 특징맵의 크기를 감소시키는 서브 샘플링 레이어(L1b)를 포함한다. 컨볼루션 레이어(L1a)는, 일 실시예에 따라, 컨볼루션 연산 전후의 영상 크기가 동일하게 유지되도록 컨볼루션 연산 수행시 패딩(padding)을 함께 수행한다. 또한, 컨볼루션 레이어(L1a)는, 컨볼루션 연산으로 산출된 특징맵에 대한 정규화(normalization: NORM), 비선형함수(RELU 또는 PRELU) 적용을 차례로 수행한 결과를 서브 샘플링 레이어(L1b)로 전달한다. 즉, 컨볼루션 레이어(L1a)는 정규화(normalization) 및 비선형함수(RELU 또는 PRELU)가 적용된 특징맵을 서브 샘플링 레이어(L1b)로 전달한다. The first convolution layer 211 includes a convolution layer L1a for inputting an image to be classified (hereinafter referred to as a detection image) as an input image (input) and performing a convolution operation (CONV) And a subsampling layer L1b that reduces the size of the feature map through sampling or pooling. The convolution layer L1a performs padding in performing the convolution operation so that the image sizes before and after the convolution operation are kept the same, according to an embodiment. In addition, the convolution layer L1a transfers the result of performing the normalization (NORM) and the nonlinear function (RELU or PRELU) on the feature map calculated by the convolution operation in turn to the subsampling layer L1b . That is, the convolution layer L1a transfers the feature map to the subsampling layer L1b to which the normalization and the nonlinear function RELU or PRELU are applied.

입력영상의 크기가 m x m 인 경우, n x n 인 부분영역(또는 수용장)을 모두 추출하여 컨볼루션 연산(CONV)하면, 아웃풋(output, 출력영상) 1장의 크기는 (m - (n - 1)) x (m - (n - 1))이 된다. 그에 따라, 컨볼루션 연산에 대한 아웃풋(출력영상)은 입력영상과 비교하면 가로와 세로가 각각 n - 1만큼 줄어들게 된다. 예를 들어, 크기가 6 x 6 인 인풋에 크기가 3 x 3 인 부분영역을 모두 추출하여 컨볼루션 연산을 적용하면, 아웃풋은 크기가 (6 - (3 - 1)) x (6 -(3 - 1)) = 4 x 4가 된다. 따라서, 일 실시예에 따라, 제1 컨볼루션 레이어(211)는, 아웃풋의 크기가 줄어드는 것을 방지하고, 인풋의 크기와 아웃풋의 크기를 같도록 패딩(padding) 기법을 수행한다. 패딩은 홀수의 n을 사용하여 입력 이미지의 상하좌우에 각각 [n / 2] 두께의 공백을 덧씌우는 것을 의미한다. 여기서 대괄호는 가우스 기호(또는 바닥 함수(floor function))를 나타낸다. If the size of the input image is mxm, the size of one output (output, output image) is (m - (n - 1)) when a partial area x (m - (n - 1)). As a result, the output (output image) for the convolution operation is reduced by n - 1 in the horizontal and vertical directions, respectively, compared with the input image. For example, if you extract all of the subregions of size 3 x 3 on an input of size 6 x 6 and apply a convolution operation, the output will have a size of (6 - (3 - 1) - 1)) = 4 x 4. Thus, according to one embodiment, the first convolution layer 211 prevents the output from decreasing in size and performs a padding scheme to equalize the size of the input and the size of the output. Padding implies overlaying [n / 2] thickness of white space on each side of the input image using an odd number n. Where the square brackets represent the Gaussian symbols (or floor functions).

또한, 인접 부분영역(또는 수용장) 사이의 간격을 스트라이드(stride)라고 지칭하고, 스트라이드가 1보다 크면 아웃풋의 가로 및 세로 길이는 각각 인풋의 가로 및 세로 길이보다 줄어들게 된다. 예를 들어, 스트라이드가 2인 경우, 아웃풋의 가로 및 세로 길이는 각 인풋의 가로 및 세로 길이의 절반이 된다. Also, the spacing between adjacent partial regions (or receiving spaces) is referred to as a stride, and if the stride is greater than 1, the horizontal and vertical lengths of the output are respectively smaller than the horizontal and vertical lengths of the input. For example, if the stride is 2, the horizontal and vertical length of the output is half of the horizontal and vertical length of each input.

제2 컨볼루션 레이어(213)는, 샘플링(sampling) 또는 풀링(pooling)을 통해 특징맵의 크기를 감소시키는 서브 샘플링 레이어는 포함하지 않고, 컨볼루션 연산(CONV)을 수행하여 특징맵을 생성하는 복수의 컨볼루션 레이어(L₂, L₃, …, L_N-1)를 포함한다. 상기 복수의 컨볼루션 레이어(L₂, L₃, …, L_N-1)들은 앞선 레이어의 아웃풋을 다음 레이어의 인풋이 되도록 직렬로 연결된다. The second convolution layer 213 does not include a subsampling layer that reduces the size of the feature map through sampling or pooling and performs a convolution operation (CONV) to generate a feature map And a plurality of convolution layers (L ₂ , L ₃ , ..., L _N-1 ). The plurality of convolutional layers L ₂ , L ₃ ,..., L _N-1 are connected in series so that the output of the preceding layer is the input of the next layer.

제2 컨볼루션 레이어(213)를 구성하는 각 컨볼루션 레이어(L₂, L₃, …, L_N-1)들은, 컨볼루션 연산(CONV) 수행시 아웃풋의 크기가 인풋의 크기와 동일하게 유지되도록 패딩을 함께 수행하고, 컨볼루션 연산으로 산출된 특징맵에 대한 정규화(NORM)와 비선형함수(RELU 또는 PRELU) 적용을 차례로 수행하여 아웃풋, 즉 특징맵을 산출한다. The convolution layers L ₂ , L ₃ , ..., L _N-1 constituting the second convolution layer 213 are set such that the size of the output is maintained equal to the size of the input during the convolution operation (CONV) (NORM) and applying the nonlinear function (RELU or PRELU) to the feature map calculated by the convolution operation in order to calculate the output, that is, the feature map.

제3 컨볼루션 레이어(215)는 제2 컨볼루션 레이어(213)의 아웃풋을 인풋으로 입력받아, 컨볼루션 연산(CONV)을 수행하여 특징맵을 생성하는 컨볼루션 레이어(L_NA)와, 샘플링(sampling) 또는 풀링(pooling)을 통해 특징맵의 크기를 감소시키는 서브 샘플링 레이어(L_Nb)를 포함한다. 상기 컨볼루션 레이어(L_Na)는, 일 실시예에 따라, 컨볼루션 연산(CONV) 수행시 아웃풋의 크기가 인풋의 크기와 동일하게 유지되도록 패딩을 함께 수행하고, 컨볼루션 연산으로 산출된 특징맵에 대한 정규화(NORM) 적용, 비선형함수(RELU 또는 PRELU) 적용을 차례로 수행한 아웃풋을 서브 샘플링 레이어(L_Nb)로 전달한다. 이때, 서브 샘플링 레이어(L_Nb)는 영상의 크기를 줄여 완전 연결 레이어(23)로 전달한다. Third keonbol convolutional layer 215 is a second keonbol receives the output of the convolutional layer 213 as input, and keonbol convolution operation keonbol convolution layer (L _NA) by performing (CONV) for generating a feature map, sampled ( and a subsampling layer (L _Nb ) that reduces the size of the feature map through sampling or pooling. The convolution layer L _Na may perform padding so that the size of the output is kept equal to the size of the input in performing the convolution operation (CONV) according to an embodiment, (NORM) to the subsampling layer (L _Nb ), and applying the nonlinear function (RELU or PRELU) to the subsampling layer (L _Nb ). At this time, the subsampling layer (L _Nb ) reduces the size of the image and transfers the reduced size to the complete connection layer (23).

완전 연결 레이어(23)는, 컨볼루션 레이어(21)로부터 전달받은 영상 이미지를 분석하여 어떤 범주에 속하는지를 최종 판단한다. 완전 연결 레이어(23)는 전역 평균 통합(global average pooling) 또는 완전 연결 계층(fully-connected layer)으로 구현될 수 있다. 예를 들어, 완전 연결 레이어(23)는 컨볼루션 레이어(21)로부터 전달받은 아웃풋, 즉, 특징맵들을 분석하여 검출영상에 표시된 손 제스처가 가위, 바위, 보, 기타 다른 손모양에 해당하는지를 최종 판단할 수 있다. The complete connection layer 23 analyzes the video image transmitted from the convolution layer 21 and finally judges to which category it belongs. The full connection layer 23 may be implemented as a global average pooling or a fully-connected layer. For example, the complete connection layer 23 analyzes the output, that is, the feature maps, received from the convolution layer 21 and determines whether the hand gesture displayed on the detected image corresponds to scissors, rock, It can be judged.

도 3을 참고하면, 일 실시예에 따라, 첫번째 컨볼루션 레이어(L₁)는 ℓ개의 특징맵(W₁), 두번째 컨볼루션 레이어(L₂)는 m개의 특징맵(W₂), 세번째 컨볼루션 레이어(L₃)는 n개의 특징맵(W₃), …, N번째 컨볼루션 레이어(L_N)는 g개의 특징맵(W_N)을 산출하는 것을 도시한다. 실시예에 따라, 하나의 인풋(100)에 대해 첫번째 컨볼루션 레이어(L₁)는 컨볼루션 연산(convolution) 결과 ℓ개의 특징맵을 생성하고, 두번째 컨볼루션 레이어(L₂)는 첫번째 컨볼루션 레이어(L₁)의 아웃풋인 ℓ개의 특징맵(W₁)을 입력받아 컨볼루션 연산을 수행하여 m개의 특징맵(W₂)을 생성한다. 동일하게, 세번째 컨볼루션 레이어(L₃)는 두번째 컨볼루션 레이어(L₂)의 아웃풋인 m 개의 특징맵(W₂)을 입력받아 컨볼루션 연산을 수행하여 n개의 특징맵(W₃)을 생성하고, 같은 방법으로 각 컨볼루션 레이어(L)를 통과하면서 컨볼루션 연산을 반복한다. 마지막 N번째 컨볼루션 레이어(L_N)의 컨볼루션 연산이 완료되면, 최종 g개의 특징맵(W_N)이 생성된다. 이때, 입력영상인 인풋(100)은 1개 이상의 채널을 가질 수 있다. 예를 들어, 입력영상인 인풋(100)이 8bit 영상인 경우, 1채널, 32bit 영상일 경우 3채널이다. 여기서, ℓ개의 특징맵(W₁)은 인풋(100)의 다양한 특징이 표현된 영상 데이터를 의미한다. Referring to FIG. 3, according to one embodiment, the first convolution layer L ₁ includes l feature maps W ₁ , the second convolution layer L ₂ includes m feature maps W ₂ , The routing layer L ₃ includes n feature maps W ₃ , ..., , And the N-th convolution layer (L _N ) shows the calculation of g feature maps (W _N ). According to an embodiment, a first convolution layer (L ₁ ) for one input (100) generates a feature map of l results of a convolution operation, a second convolution layer (L ₂ ) (L) feature maps (W ₁ ), which are output of the feature map (L ₁ ), and performs convolution operation to generate m feature maps (W ₂ ). Similarly, the third convolution layer L ₃ receives m feature maps W ₂ , which are outputs of the second convolution layer L ₂ , and performs convolution operation to generate n feature maps W ₃ And repeats the convolution operation while passing through each convolution layer L in the same way. When the convolution operation of the last N-th convolution layer L _N is completed, the last g feature maps W _N are generated. At this time, the input 100, which is an input video, may have one or more channels. For example, when the input video 100 is an 8-bit video, it is 1-channel, and in the case of a 32-bit video, it is 3-channel. Here, the l feature maps W ₁ indicate image data in which various features of the input 100 are expressed.

컨볼루션 필터(또는 커널필터)는 각 컨볼루션 레이어(L) 마다 종류, 개수, 크기(n×n) 등이 다를 수 있을 뿐만 아니라, 동일 컨볼루션 레이어(L) 내에서 구현되는 복수의 컨볼루션 필터들도 각각 다른 종류로 구현될 수 있다. 컨볼루션 필터의 종류는 적색, 녹색, 청색과 같은 색감관련 필터이거나, 기타 다양한 손의 특징을 찾기 위한 특성을 갖는 필터로 구현될 수 있다. The convolution filter (or the kernel filter) may be different in kind, number, size (nxn), etc. for each convolution layer L, and may also include a plurality of convolutional layers (L) The filters may also be implemented in different types. The type of the convolution filter may be a color-related filter such as red, green, or blue, or a filter having characteristics to find various other hand features.

도 4를 참고하면, 일 실시예에 따라, 첫번째 컨볼루션 레이어(211: L₁)는 컨볼루션 레이어(211a) 및 서브 샘플링 레이어(211b)를 포함하고, 두번째 이후부터 N-1번째까지 복수의 컨볼루션 레이어(213: L₂, L₃, …, L_N-1)는 서브 샘플링 레이어 없이 컨볼루션 레이어만으로 구성되고, 그리고 마지막 N번째 컨볼루션 레이어(215: L_N)는 컨볼루션 레이어(215a) 및 서브 샘플링 레이어(215b)를 포함한다. 4, the first convolution layer 211 (L ₁ ) includes a convolution layer 211a and a subsampling layer 211b, and a plurality of (N-1) th convolution layer _{_{(213: L 2, L 3}} , ..., L N-1) is of only convolutional layer without subsampling layer, and the last N-th convolution layer (215: L _N) is a convolution layer (215a And a subsampling layer 215b.

예를 들어, 첫번째 컨볼루션 레이어(211: L₁)에서, 컨볼루션 레이어(211a)가 커널필터를 활용하여 하나의 인풋(100)에 대한 컨볼루션 연산(CONV) 결과 3개의 특징맵(101)을 산출하면, 서브 샘플링 레이어(211b)는 3개의 특징맵(101)에 풀링 또는 샘플링을 수행하여 크기가 감소된 특징맵(102)을 산출한다. 이후, 상기 특징맵(102)에 대해 두번째 이후부터 N-1번째까지 복수의 컨볼루션 레이어(213: L₂, L₃, …, L_N-1)에서 컨볼루션 연산이 반복 수행되어 특징맵의 수는 증가한다. 마지막 컨볼루션 레이어(215: L_N)에서, 컨볼루션 레이어(215a)에서 컨볼루션 연산이 완료되어 최종 g개의 특징맵(100’)이 산출되면, 서브 샘플링 레이어(215b)는 상기 특징맵(100’)에 풀링 또는 샘플링을 수행하여 크기가 감소된 특징맵(100”)을 산출한다. For example, in the first convolution layer 211 (L ₁ ), the convolution layer 211a uses the kernel filter to convolute the convolution operation (CONV) for one input 100, The subsampling layer 211b performs pooling or sampling on the three feature maps 101 to calculate the feature map 102 whose size has been reduced. Thereafter, the convolution operation is repeatedly performed on the feature map 102 from the second to the (N-1) th convolution layers 213 (L ₂ , L ₃ , ..., L _N-1 ) The number increases. When the convolution operation is completed in the convolution layer 215a and the last g feature maps 100 'are calculated in the last convolution layer 215 (L _N ), the subsampling layer 215b generates the feature map 100 ') To calculate a reduced feature map 100''.

도 5는, 학습엔진(13)이 도 2에 도시된 구조로 설계된 컨볼루션 신경망을 통해 학습한 학습 데이터를 보여주며, 상기 학습데이터는 손 제스처 검출에 최적화된 파라미터뿐만 아니라, 직렬결합된 컨볼루션 레이어(21)의 레이어 개수(N), 즉, 제2 컨볼루션 레이어(213: L₂, L₃, …, L_N-1)의 개수도 포함된다. 5 shows the learning data learned by the learning engine 13 through the convolutional neural network designed with the structure shown in FIG. 2, and the learning data includes parameters optimized for hand gesture detection, The number of layers N of the layer 21, that is, the number of the second convolution layers 213 (L ₂ , L ₃ , ..., L _N-1 ) is also included.

도 5를 참고하면, 컨볼루션 레이어(21)는 총 10개(N=10)의 레이어로 구성되며, 이중 제2 컨볼루션 레이어(213)는 8개의 레이어가 직렬로 연결되어 구성된다. Referring to FIG. 5, the convolution layer 21 is composed of 10 layers (N = 10) in total, and the second convolution layer 213 is formed by connecting eight layers in series.

실시예에 따라, 제1 컨볼루션 레이어(211: L₁)에서, 컨볼루션 레이어(L_1a)는 3×3(ker 3) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)을 수행하고, 컨볼루션 연산으로 산출된 맵에 정규화(NORM)와 비선형함수(PRELU) 적용을 차례로 수행하여 최종 특징맵 16개(out 16)를 산출한다. 이때, 컨볼루션 레이어(L_1a)는 컨볼루션 연산 수행시, 상기 컨볼루션 연산 결과인 특징맵의 상하좌우에 각각[n/2=1/2] 두께의 공백을 덧씌우는 패딩(ex, zero padding, n=1)을 수행한다. 이후, 서브 샘플링 레이어(L_1b)는 크기가 3×3(ker 3)이고, 인접 수용장 사이 간격이 2(stride 2)이며, 최대값(max)을 뽑는 풀링을 수행하여 상기 특징맵의 크기를 줄인다. 따라서, 하나의 인풋(검출영상)에 대해, 제1 컨볼루션 레이어(211)의 아웃풋(즉,특징맵)은 16개이다. According to an embodiment, in the first convolution layer 211 (L ₁ ), the convolution layer L _1a performs a convolution operation (CONV) using a kernel filter of size 3x3 (ker 3) (NORM) and nonlinear function (PRELU) are sequentially applied to the map calculated by the convolution operation to calculate 16 final feature maps (out 16). At this time, the convolution layer L _1a performs padding (ex, zero padding) to cover spaces of [n / 2 = 1/2] on the top, bottom, right and left sides of the feature map, which is the result of the convolution operation, , n = 1). Thereafter, the subsampling layer L _1b performs pooling to extract the maximum value max, which is 3 × 3 (ker 3), the interval between adjacent storage spaces is 2 (stride 2) . Therefore, for one input (detection image), the output (i.e., feature map) of the first convolution layer 211 is sixteen.

제2 컨볼루션 레이어(213: L₂, L₃, …, L₉)는, 컨볼루션 연산을 반복하여 수행한다. 즉, 2번째부터 9번째까지 컨볼루션 레이어(213: L₂, L₃, …, L₉)들은 앞선 레이어의 아웃풋이 연이은 레이어의 인풋으로 입력되도록 서로 직렬로 연결되며, 모두 컨볼루션 연산(CONV), 정규화(NORM), 비선형함수(PRELU) 적용을 차례로 수행하여 아웃풋 즉, 특징맵을 산출한다. The second convolution layer 213 (L ₂ , L ₃ , ..., L ₉ ) performs the convolution operation repeatedly. That is, the convolution layers 213 (L ₂ , L ₃ , ..., L ₉ ) from the second to the ninth are connected in series to each other so that the output of the preceding layer is input to the input of the subsequent layer, ), Normalization (NORM), and application of the nonlinear function (PRELU) in order to calculate the output, that is, the feature map.

일 실시예에 따라, 두번째 레이어(L₂)는, 1×1(ker 1) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)을 수행하고 패딩(n=0)을 수행하여 특징맵 16개(out 16)를 생성한다. 세번째 레이어(L₃)는, 1×1(ker 1) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)과 패딩(n=0)을 수행하여 특징맵 32개(out 32)를 생성한다. 네번째 레이어(L₄)는, 1×1(ker 1) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)과 패딩(n=0)을 수행하여 특징맵 64개(out 64)를 생성한다. 다섯번째 레이어(L₅)는, 3×3(ker 3) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)과 패딩(n=1)을 수행하여 특징맵 64개(out 64)를 생성한다. 여섯번째 레이어(L₆)는, 1×1(ker 1) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)과 패딩(n=0)을 수행하여 특징맵 64개(out 64)를 생성한다. 일곱번째 레이어(L₇)는, 3×3(ker 3) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)과 패딩(n=1)을 수행하여 특징맵 128개(out 128)를 생성한다. 여덟번째 레이어(L₈)는, 1×1(ker 1) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)과 패딩(n=0)을 수행하여 특징맵 128개(out 128)를 생성한다. 아홉번째 레이어(L₉)는, 3×3(ker 3) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)과 패딩(n=1)을 수행하여 특징맵 256개(out 256)를 생성한다. 상기 파라미터들은, 학습엔진(13)이 도 2에 도시된 구조로 설계된 컨볼루션 신경망을 통해 학습한 학습 데이터이며, 손 제스처 검출에 최적화된 실시예이다. According to one embodiment, the second layer L ₂ performs a convolution operation (CONV) using a kernel filter of 1 × 1 (ker 1) size and performs padding (n = 0) (out 16). The third layer L ₃ performs a convolution operation (CONV) and padding (n = 0) using a kernel filter having a size of 1 × 1 (ker 1) to generate 32 feature maps (out 32). The fourth layer L ₄ performs a convolution operation (CONV) and padding (n = 0) using a kernel filter having a size of 1 × 1 (ker 1) to generate 64 feature maps (out 64). The fifth layer L ₅ performs a convolution operation (CONV) and padding (n = 1) using a kernel filter having a size of 3 × 3 (ker 3) to generate 64 feature maps (out 64) . The sixth layer L ₆ performs a convolution operation (CONV) and padding (n = 0) using a kernel filter having a size of 1 × 1 (ker 1) to generate 64 feature maps (out 64) . The seventh layer L ₇ performs a convolution operation (CONV) and padding (n = 1) using a kernel filter having a size of 3 × 3 (ker 3) to generate 128 feature maps (out 128) . The eighth layer L ₈ performs a convolution operation (CONV) and padding (n = 0) using a kernel filter having a size of 1 × 1 (ker 1) to generate 128 feature maps (out 128) . The ninth layer L ₉ performs a convolution operation (CONV) and padding (n = 1) using a kernel filter having a size of 3 × 3 (ker 3) to generate 256 feature maps (out 256) . The parameters are learning data learned by the learning engine 13 through the convolutional neural network designed in the structure shown in FIG. 2, and are examples optimized for hand gesture detection.

따라서, 제2 컨볼루션 레이어(213: L₂, L₃, …, L₉)에서는 컨볼루션 연산이 반복 수행된다. 이때, 복수의 컨볼루션 레이어(213: L₂, L₃, …, L₉) 각각에서 컨볼루션 연산 수행시, 컨볼루션 연산 결과인 특징맵의 상하좌우에 각각[n/2] 두께의 공백을 덧씌우는 패딩(ex, zero padding)도 함께 수행된다. 실시예에 따라, 제2 컨볼루션 레이어(213)에서는 풀링 또는 샘플링 처리 없이 컨볼루션 연산만 반복된다. Therefore, the convolution operation is repeatedly performed in the second convolution layer 213 (L ₂ , L ₃ , ..., L ₉ ). At this time, when the convolution operation is performed in each of the plurality of convolution layers 213 (L ₂ , L ₃ , ..., L ₉ ), a space of [n / 2] Overlapped padding (ex, zero padding) is also performed. According to the embodiment, in the second convolution layer 213, only the convolution operation is repeated without pooling or sampling processing.

마지막 10번째 컨볼루션 레이어(L₁₀), 즉 제3 컨볼루션 레이어(215)의 컨볼루션 레이어(L_10a)는 제2 컨볼루션 레이어(213)의 아웃풋(특징맵)을 인풋으로 하여 256개의 커널필터(W₁₀)에 의한 컨볼루션 연산을 수행하고, 컨볼루션 연산(CONV)으로 산출된 아웃풋에 정규화(NORM)와 비선형함수(PRELU) 적용을 차례로 수행하여 256개에 대한 최종 아웃풋(특징맵)을 산출한다. 이때, 컨볼루션 레이어(L_10a)는 컨볼루션 연산 수행시, 상기 컨볼루션 연산 결과인 특징맵의 상하좌우에 각각[n/2=1/2] 두께의 공백을 덧씌우는 패딩(ex, zero padding, n=1)을 수행한다. 이후, 서브 샘플링 레이어(L_10b)는, 크기가 5×5(ker 5)이고, 인접 수용장 사이 간격이 2(stride 2)이며, 평균값(average)을 뽑는 풀링(POOL)을 수행하여 상기 특징맵들의 크기를 줄인다. The convolution layer L _10a of the last convolution layer L ₁₀ , that is, the convolution layer 215 of the third convolution layer 215, has 256 kernels as the input of the output (characteristic map) of the second convolution layer 213, performing the convolution operation by a filter (W _10), and convolution operation (CONV), the final output for the 256 and the calculated output do the normalization (NORM) and the non-linear function (PRELU) applied sequentially to the (characteristic map) . At this time, the convolution layer L _10a performs padding (ex, zero padding) to cover spaces of [n / 2 = 1/2] in the top, bottom, left and right of the feature map, which is the result of the convolution operation, , n = 1). Subsequently, the subsampling layer L _10b performs pooling (POOL) in which the size is 5 × 5 (ker 5), the interval between adjacent storage spaces is 2 (stride 2), and an average value is extracted, Reduce the size of the maps.

하나의 검출영상에 대해, 최종 컨볼루션 레이어(21)의 아웃풋, 즉, 특징맵은 256개이고, 상기 특징맵들은 완전 연결 레이어(23: FC)로 전달되어, 검출영상이 어떤 범주에 속하는지 판단하는 자료가 된다. For one detected image, the output of the final convolution layer 21, that is, the feature map is 256, and the feature maps are transmitted to the completely connected layer 23 (FC) to judge to which category the detected image belongs .

학습엔진(13)은 도 2에 도시된 설계구조로 구성되는 컨볼루션 신경망을 통해 학습하여, 도 5에 도시된 바와 같이, 손 제스처 분류에 가장 적합한 컨볼루션 레이어(21)를 구성하는 레이어 개수(N=10)를 생성할 수 있다. 또한, 컨볼루션 연산을 수행하는 커널필터(W)의 종류, 개수(out), 크기(ker)는 상기 복수의 컨볼루션 레이어(L₁, L₂, L₃, …, L₉, L₁₀) 마다 동일하게 구성되지 않고 각각 다르게 독립적으로 구성될 수 있으며, 학습엔진(13)의 학습으로 복수의 컨볼루션 레이어(L₁, L₂, L₃, …, L₉, L₁₀)에 포함되는 커널필터(W)의 종류, 개수(out), 크기(ker)와 같은 파라미터도 앞서 검토한대로, 개별적으로 산출될 수 있다. 또한, 컨볼루션 연산 수행시 패딩의 파라미터(n)도 학습엔진(13)의 학습으로 복수의 컨볼루션 레이어(L₁, L₂, L₃, …, L₉, L₁₀) 마다 개별적으로 산출될 수 있다. The learning engine 13 learns through the convolutional neural network composed of the design structure shown in FIG. 2 and calculates the number of layers constituting the convolution layer 21 most suitable for the hand gesture classification N = 10). In addition, convolution operation type of kernel filter (W) to perform, the number (out), the size (ker) has the plurality of convolution layer _{_{(L 1, L 2, L}} 3, ..., L 9, L 10) the kernel included in the same is not configured differently, and may be configured independently, the learning engine 13 is learning a plurality of convolutional layer in the _{_{(L 1, L 2, L}} 3, ..., L 9, L 10) each Parameters such as the type, the number of outs, and the size ker of the filter W can be separately calculated as discussed above. In addition, the convolution operation performed when the padding of the parameter (n) also learning engine 13 is learning a plurality of convolution layers of be separately calculated for each _{_{(L 1, L 2, L}} 3, ..., L 9, L 10) .

도 6는 다른 실시예에 따른 컨볼루션 신경망(CNN)의 구조를 도시한 도면이며, 도 7은 도 6의 컨볼루션 신경망(CNN)에 대한 학습으로 파라미터가 산출된 예시도이며, 도 8은 도 6의 컨볼루션 신경망(CNN)의 구조에서 제2 컨볼루션 레이어가 복수개 직렬 연결된 예시도이다. FIG. 6 is a diagram showing a structure of a convolutional neural network (CNN) according to another embodiment, FIG. 7 is an example of a parameter calculated by learning about a convolutional neural network (CNN) of FIG. 6, 6 is a diagram illustrating an example in which a plurality of second convolution layers are connected in series in a structure of a convolutional neural network CNN.

도 6을 참고하면, 다른 실시예에 따른 컨볼루션 신경망은, 컨볼루션 레이어(21: 211, 213, 215), 그리고 완전 연결 레이어(23)를 포함한다. 6, the convolutional neural network according to another embodiment includes a convolution layer 21 (211, 213, 215), and a full connection layer 23.

컨볼루션 레이어(21)는, 5개의 컨볼루션 레이어(L₁, L₂, L₃, L₄, L₅)를 포함하고, 상기 5개의 컨볼루션 레이어(L₁, L₂, L₃, L₄, L₅)는 기능에 따라 제1 컨볼루션 레이어(211: L₁), 제2 컨볼루션 레이어(213: L₂, L₃, L₄), 제3 컨볼루션 레이어(215: L₅)로 구별될 수 있다. The convolution layer 21 includes five convolution layers L ₁ , L ₂ , L ₃ , L ₄ and L ₅ and the five convolution layers L ₁ , L ₂ , L ₃ , L _4, L ₅₎ of the first convolutional layer (211 according to the function: L _1), the second convolutional layer _{_{(213: L 2, L 3}} , L 4), the third convolution layer (215: L ₅₎ &Lt; / RTI >

도 6에 도시된 컨볼루션 신경망은, 도 2에 도시된 컨볼루션 신경망과 비교하여 제2 컨볼루션 레이어(213)만 상이하고, 나머지 구성은 동일하도록 설계될 수 있다. 즉, 도 6의 제1 컨볼루션 레이어(211: L₁) 및 제3 컨볼루션 레이어(215: L₅)는 도 2에 도시된 제1 컨볼루션 레이어(211: L₁) 및 제3 컨볼루션 레이어(215: L_N)에 각각 대응되나, 해당 레이어의 파라미터들은 학습엔진(13)의 학습으로 다르게 산출될 수 있다. 따라서, 도 6을 참고하면, 제1 컨볼루션 레이어(211: L₁) 및 제3 컨볼루션 레이어(215: L₅)는 특징맵의 크기를 줄이는 서브 샘플링(subsampling) 또는 풀링(pooling) 과정을 수행하나, 제2 컨볼루션 레이어(213: L₂, L₃, L₄)는 특징맵의 크기를 줄이는 과정을 수행하지 않는다. The convolutional neural network shown in Fig. 6 may be designed so that the second convolution layer 213 differs from the convolutional neural network shown in Fig. 2 only, and the remaining configurations are the same. That is, a first convolutional layer of Figure 6 (211: L ₁₎ and the third convolution layer (215: L ₅₎ is a first convolutional layer shown in Figure 2 (211: L ₁₎ and the third convolution And the layer 215 (L _N ), respectively. However, the parameters of the layer may be calculated differently depending on the learning of the learning engine 13. 6, the first convolution layer 211 (L ₁ ) and the third convolution layer 215 (L ₅ ) perform a subsampling or a pooling process to reduce the size of the feature map , But the second convolution layer 213 (L ₂ , L ₃ , L ₄ ) does not reduce the size of the feature map.

도 6과 같이 설계된 컨볼루션 신경망은, 도 2와 비교하여, 컨볼루션 연산 수행 횟수를 줄여 처리속도를 높임과 동시에, 단순히 컨볼루션 연산 횟수만을 줄인 것이 아니라 한번의 컨볼루션 연산을 수행한 특징맵과 두번의 컨볼루션 연산을 수행한 특징맵을 하나의 맵으로 합친 후 합쳐진 특징맵을 다음 단계에서 활용하도록 설계하여 손 제스처 인식의 정확성도 유지될 수 있도록 설계되었다. 도 6의 제1 컨볼루션 레이어(211: L₁) 및 제3 컨볼루션 레이어(215: L₅)는 도 2에 도시된 제1 컨볼루션 레이어(211: L₁) 및 제3 컨볼루션 레이어(215: L₇)에 각각 대응되므로, 이하에서는, 제2 컨볼루션 레이어(213: L₂, L₃, L₄)에 대해서 자세하게 설명한다. The convolutional neural network designed as shown in FIG. 6 has a feature map that performs a convolution operation at a time instead of reducing the convolution operation number only by increasing the processing speed by reducing the convolution operation execution frequency It is designed to combine feature maps that have been subjected to two convolution operations into a single map and to use the combined feature map in the next step so that the accuracy of hand gesture recognition can be maintained. The first convolution layer 211 (L ₁ ) and the third convolution layer 215 (L ₅ ) of FIG. 6 correspond to the first convolution layer 211 (L ₁ ) and the third convolution layer 215: L ₇ ). Therefore, the second convolution layer 213 (L ₂ , L ₃ , L ₄ ) will be described in detail below.

제2 컨볼루션 레이어(213: L₂, L₃, L₄)는, 제1 병렬 레이어(2131), 제2 병렬 레이어(2133: 2133a, 2133b), 퓨전 레이어(2135), 그리고 노이즈 감소 레이어(2137)를 포함한다. 제1 병렬 레이어(2131)와 제2 병렬 레이어(2133: 2133a, 2133b)는 병렬로 연결된다. The second convolution layer 213 (L ₂ , L ₃ , L ₄ ) includes a first parallel layer 2131, a second parallel layer 2133: 2133a, 2133b, a fusion layer 2135, 2137). The first parallel layer 2131 and the second parallel layer 2133 (2133a, 2133b) are connected in parallel.

제1 병렬 레이어(2131)는, 상기 제1 컨볼루션 레이어(211: L₁)의 아웃풋을 인풋으로 하여 컨볼루션 연산(CONV)을 수행하고 컨볼루션 연산으로 산출된 특징맵에 정규화(NORM)를 수행하여 제1 특징맵을 산출한다. The first parallel layer 2131 performs a convolution operation (CONV) with the output of the first convolution layer 211 (L ₁ ) as an input, performs normalization (NORM) on the feature map calculated by the convolution operation And calculates a first feature map.

제2 병렬 레이어(2133: 2133a, 2133b)는, 제1 컨볼루션 레이어(211: L₁)의 아웃풋을 인풋으로 컨볼루션 연산(CONV)을 수행하고 컨볼루션 연산으로 산출된 특징맵에 정규화(NORM)와 비선형함수(ex, PRELU 또는 RELU) 적용을 차례로 수행하는 제1 레이어(2133a)와, 상기 제1 레이어(2133a)의 아웃풋을 인풋으로 하여 컨볼루션 연산(CONV)을 수행하고 컨볼루션 연산으로 산출된 특징맵에 정규화(NORM)를 적용하여 제2 특징맵을 산출하는 제2 레이어(2133b)를 포함한다. 상기 제1 레이어(2133a)와 제2 레이어(2133b)는 직렬로 연결된다. Second parallel layer (2133: 2133a, 2133b) includes: a first convolutional layer: normalized to the the output of the (211 L ₁₎ to the input to perform a convolution operation (CONV) is calculated by convolution operation characteristic map (NORM And a nonlinear function (ex, PRELU or RELU), and a first layer 2133a performing an output of the first layer 2133a as an input and performing a convolution operation (CONV) And a second layer 2133b for applying a normalization (NORM) to the calculated feature map to calculate a second feature map. The first layer 2133a and the second layer 2133b are connected in series.

퓨전 레이어(2135)는, 제1 특징맵과 제2 특징맵에 대해 퓨전연산(fusion, concentration)을 수행하여 하나의 맵을 생성한다. 실시예에 따라, 제1 특징맵과 제2 특징맵의 가로 및 세로 크기는 동일하나 각 특징맵 추출에 활용한 컨볼루션 연산의 수행 횟수와 아웃풋 크기가 다른 경우, 제1 특징맵과 제2 특징맵의 3차원 크기는 다를 수 있다. 컨볼루션 연산 수행 횟수가 다른 구조를 활용하여 얻어진 제1 특징맵과 제2 특징맵을 퓨전(fusion) 또는 결합(concentration)함으로써 하나의 영상 이미지를 다양한 측면에서의 특징 추출이 가능하다. 즉, 서로 다른 컨볼루션 연산(연산 수행 횟수, 가중치(weight))을 활용하여 특징을 추출하므로 하나의 인풋에 대해 다 측면 특징 추출이 가능하고 그 결과 손 제스처 분류 성능을 향상시킬 수 있다. 또한, 결합 전에 제1 특징맵과 제2 특징맵이 각각 정규화(normalization, NORM)되므로, 제1 특징맵과 제2 특징맵의 특성은 유지된다. The fusion layer 2135 performs fusion and concentration on the first feature map and the second feature map to generate one map. According to the embodiment, when the first feature map and the second feature map have the same horizontal and vertical sizes, but when the number of convolution operations used for each feature map extraction is different from the output size, the first feature map and the second feature The 3D size of the map may be different. It is possible to extract features from various aspects of one image image by fusion or concentration of the first feature map and the second feature map obtained by utilizing structures having different convolution operation execution times. In other words, by extracting features using different convolution operations (operation execution frequency and weight), it is possible to extract multi-faceted features for a single input, and as a result, hand gesture classification performance can be improved. Also, since the first feature map and the second feature map are normalized (NORM) before coupling, the characteristics of the first feature map and the second feature map are maintained.

노이즈 감소 레이어(2137)는 제1 특징맵과 제2 특징맵의 퓨전으로 결합된 특징맵에 비선형함수(ex, PRELU) 적용을 수행한다. 노이즈 감소 레이어(2137)는 아웃풋을 제3 컨볼루션 레이어(215: L₅)로 전달하며, 제3 컨볼루션 레이어(215: L₅)에서 컨볼루션 연산 및 풀링을 수행한다. The noise reduction layer 2137 performs a nonlinear function (ex, PRELU) application to the feature maps combined by fusion of the first feature map and the second feature map. A noise reduction layer (2137) is the third layer, the convolution output: and passed to (215 L _5), the third convolution layer: performs a convolution operation and pooling (215 L _5).

도 7은, 학습엔진(13)이 도 6에 도시된 구조로 설계된 컨볼루션 신경망을 통해 학습한 학습 데이터를 보여주며, 상기 학습데이터는 손 제스처 검출에 최적화된 파라미터로 구현된다. FIG. 7 shows learning data learned by the learning engine 13 through the convolutional neural network designed with the structure shown in FIG. 6, and the learning data is implemented with parameters optimized for hand gesture detection.

도 2 및 도 5에 도시된 바와 같이, 일련의 컨볼루션 레이어들을 연이어 연결함으로써 컨볼루션 연산을 반복하여 수행하면, 정확도를 높이는 깊은 학습이 가능 한 반면, 연산의 양이 많아져 처리속도는 느려질 수 있다. 그러나, 도 2 및 도 5와 같은 학습구조에서 컨볼루션 연산 수행 횟수(학습 깊이), 즉 커널필터 및 특징맵의 수를 다소 축소하게 되면 처리속도는 빨라지나 정확하게 특징값을 추출하고 분류해 내는 학습이 불가능할 수 있으며 이후, 손 제츠처 분류의 정확도가 낮아질 수 있다. 따라서, 다른 실시예에 따라, 분류의 정확도를 많이 낮추지 않으면서도 처리속도는 향상시킬 수 있는 설계구조를 도 6에 도시하고, 이러한 설계구조를 통한 학습시 손 제스처를 정확도 높게 분류할 수 있는 파라미터를 도 7에 도시한다. 또한, 또 다른 실시예에 따라, 도 6에 도시된 다른 실시예에 따른 제2 컨볼루션 레이어(213)를 직렬로 복수 회 연결하여 설계된 컨볼루션 신경망을 도 8에 도시한다. As shown in FIG. 2 and FIG. 5, if the convolution operation is repeatedly performed by successively connecting a series of convolution layers, it is possible to perform deep learning that increases the accuracy, while the processing speed becomes slow have. However, if the number of convolution operation execution times (learning depth), that is, the number of kernel filters and feature maps in the learning structure shown in FIGS. 2 and 5 is somewhat reduced, the processing speed becomes faster. However, May be impossible, and then the accuracy of the handset classification may be lowered. Thus, according to another embodiment, a design structure capable of improving the processing speed without lowering the accuracy of the classification is shown in Fig. 6, and a parameter capable of classifying the hand gesture with high accuracy during learning through this design structure Is shown in Fig. In addition, according to another embodiment, a convolutional neural network designed by connecting a second convolution layer 213 according to another embodiment shown in FIG. 6 in series a plurality of times is shown in FIG.

도 8을 참고하면, 또 다른 실시예에 따른 컨볼루션 신경망은, 도 6에 도시된 다른 실시예에 따른 제2 컨볼루션 레이어(213: L₂, L₃, L₄) 구조를 직렬로 복수 회 연결하며, 제1 컨볼루션 레이어(211: L₁) 및 제3 컨볼루션 레이어(215: L5)는 동일하게 구성한다. 8, the convolutional neural network according to another embodiment includes a second convolution layer 213 (L ₂ , L ₃ , L ₄ ) structure according to another embodiment shown in FIG. 6, And the first convolution layer 211 (L ₁ ) and the third convolution layer 215 (L ₅ ) are constructed in the same manner.

도 8에 도시된 컨볼루션 신경망은 제2 컨볼루션 레이어(213: L₂, L₃, L₄)가 복수회 반복되도록 구성되므로 비슷한 모양이나 특징을 갖는 다수의 대상(ex, 손 제스처)을 분류할 때 효과적이다. 즉, 제2 컨볼루션 레이어(213: L₂, L₃, L₄)가 한번 삽입된 경우보다 여러 번 삽입된 구조가 학습 및 분류를 처리하는데 소비되는 시간이 길어질 수 있으나 깊은 학습이 가능하므로 더 많은 종류의 대상을 분류하는 데 효과적이다. The convolutional neural network shown in FIG. 8 is configured such that the second convolution layer 213 (L ₂ , L ₃ , L ₄ ) is repeated a plurality of times so that a plurality of objects (eg, hand gestures) Effective. In other words, since the structure in which the second convolution layer 213 (L ₂ , L ₃ , L ₄ ) is inserted once may have a longer time for processing the learning and classification than the case where the second convolution layer 213 It is effective to classify many kinds of objects.

도 9는 실시예에 따라 컨볼루션 신경망(CNN)에 대한 학습에 활용되는 학습영상의 예시도이고, 도 10은 실시예에 따른 컨볼루션 신경망(CNN) 구조를 이용하는 손 제스처 분류기가 도출한 최종 분류결과의 예시도이다. FIG. 9 is an exemplary view of a learning image used for learning of a convolutional neural network (CNN) according to an embodiment, and FIG. 10 is a diagram illustrating an example of a learning image using a convolutional neural network (CNN) Fig.

도 9에 도시된 영상은, 학습엔진(13)이 실시예에 따라 설계된 컨볼루션 신경망을 통해 학습시, 학습에 활용되는 인풋의 예시를 보여준다. 크기, 회전, 반전, 히스토그램, 평활화, 블러링, 감마변환, 밝기변화, 원근 왜곡 등 하나의 영상에 다양한 처리를 수행하여 변화를 준 다양한 영상을 활용하여 학습엔진(13)은 학습할 수 있다. The image shown in Fig. 9 shows an example of an input used for learning during learning through the convolutional neural network designed according to the embodiment of the learning engine 13. The learning engine 13 can learn by using various images that have undergone various processes by performing various processes on one image such as size, rotation, inversion, histogram, smoothing, blurring, gamma conversion, brightness change, and perspective distortion.

도 10을 참고하면, 실시예에 따른 컨볼루션 신경망(CNN) 구조를 이용하는 손 제스처 분류기가 손 영상 이미지를 분석한 결과를 보여준다. 도 10(a)를 참고하면, 실선 박스로 제스처 후보영역이 검출되고, 도 10(b)의 분석 테이블은 상기 분류기가 후보영역에 대한 N 개의 특징맵을 분석한 최종 결과를 도시한다. 도 10(b)의 분석 테이블을 검토하면, 분류기의 최종 분류결과에서 확률이 0.5 이상이면서 확률이 가장 높은 제스처를 최종 결과로 판단할 수 있다. 상기 후보영역에 대해 제스처 1이 99.89% 이상의 확률로 1순위이므로 최종결과는 제스처 1로 판단할 수 있다. 예를 들어, 제스처 1은 다양한 손 제스처(ex, 가위, 바위, 보) 중 '보'에 대응될 수 있다. Referring to FIG. 10, a hand gesture classifier using a convolutional neural network (CNN) structure according to an embodiment shows a result of analyzing a hand image image. 10 (a), a gesture candidate region is detected by a solid line box, and the analysis table of FIG. 10 (b) shows the final result of analyzing the N feature maps for the candidate region by the classifier. By examining the analysis table of FIG. 10 (b), it is possible to determine a gesture having a probability of 0.5 or more and the highest probability as the final result in the final classification result of the classifier. Since the gesture 1 has a probability of 99.89% or more with respect to the candidate region, the final result can be judged as the gesture 1. For example, gesture 1 may correspond to a 'beam' of various hand gestures (ex, scissors, rock, beam).

도 11은 또 다른 실시예에 따른 제스처를 이용한 기기 제어 시스템을 보여주는 전체 개념도이다.11 is an overall conceptual diagram showing a device control system using a gesture according to another embodiment.

도 11을 참고하면, 제스처를 이용한 기기 제어 시스템은 영상 입력장치(1), 모니터(2), 제스처 표시영역(3), 그리고 제스처 인식 장치(4)를 포함할 수 있다. 11, a device control system using a gesture may include a video input device 1, a monitor 2, a gesture display area 3, and a gesture recognition device 4. [

영상 입력장치(1)는 사용자의 손모양(hand shape) 또는 손동작(hand gesture)을 인식하기 위하여 검출영상을 획득한다. 예를 들어, 영상 입력장치(1)는 깊이 인식 카메라, 스테레오 카메라, 컬러 카메라로 구현될 수 있으며(ex, 키넥트(kinect) 카메라), 검출영상으로 동영상 및 정지영상을 획득할 수 있다. 검출영상이 동영상인 경우, 복수의 연속적인 프레임들로 구성될 수 있다. 또한, 검출영상은 컬러영상, 깊이영상 및 컬러-깊이(RGB-C) 영상을 포함할 수 있다. The image input apparatus 1 acquires a detection image to recognize a hand shape or a hand gesture of a user. For example, the image input device 1 may be implemented as a depth recognition camera, a stereo camera, and a color camera (ex, kinect camera), and may acquire moving images and still images as detected images. When the detected image is a moving image, it may be composed of a plurality of consecutive frames. Also, the detected image may include a color image, a depth image, and a color-depth (RGB-C) image.

모니터(2)는 제스처 표시영역(3) 상에서 사용자가 움직이는 손의 모양, 동작, 손의 위치에 대응되는 영상을 표시한다. 따라서, 사용자는 제스처 표시영역(3)을 벗어나지 않고 제스처 표시영역(3) 범위 내에서 사용자가 의도한 손모양 및 손동작이 제스처 인식 장치(4)로 전달되는지를 모니터(2)를 통해 확인할 수 있다. 만약, 사용자가 의도한 바와 상이한 손모양이나 손동작이 제스처 인식 장치(4)에 전달되어 모니터(2)에 표시되는 경우, 사용자는 손모양이나 손동작을 수정하여 다시 전달되도록 제스처 표시영역(3)에 표시할 수 있다. The monitor 2 displays an image corresponding to a shape, an operation, and a hand position of a hand moving by the user on the gesture display area 3. [ Therefore, the user can confirm via the monitor 2 whether the hand shape and the hand operation intended by the user are transmitted to the gesture recognition device 4 within the range of the gesture display area 3 without leaving the gesture display area 3 . If a hand shape or a hand gesture different from that intended by the user is transmitted to the gesture recognition device 4 and displayed on the monitor 2, the user modifies the hand gesture or hand gesture, Can be displayed.

제스처 표시영역(3)은 사용자의 손모양이나 손동작에 대한 정보를 전달하는 영역으로 원격에서 사용자 정의로 생성된다. 실시예에 따라, 사용자는 고정된 위치가 아닌 스스로 지정한 위치(T)에서 원격으로 제어신호를 표시하는 임의의 영역을 정의하여 제스처 표시영역(3)을 생성할 수 있다. The gesture display area (3) is created as a user-defined area remotely as an area for conveying information on the hand shape or the hand movements of the user. According to the embodiment, the user can create a gesture display area 3 by defining an arbitrary area that displays a control signal remotely at a position (T) designated by himself, rather than at a fixed position.

도 11을 참고하면, 모니터(2) 상에 4개의 모니터좌표(M1, M2, M3, M4)가 표시되면, 사용자는 스스로 지정한 원격위치(T)에서 상기 4개의 모니터좌표(M1, M2, M3, M4)를 따라 허공에 4개 지점에 손 영역 좌표(G)를 표시한다. 영상 입력장치(1)가 손 영역 좌표(G)가 표시된 영상을 획득하고, 제스처 인식 장치(4)가 영상 입력장치(1)에 의해 획득된 영상을 분석하여 기준좌표(B1, B2, B3, B4)를 도출할 수 있다. 일 실시예에 따라, 제스처 표시영역(3)은 상기 기준좌표(B1, B2, B3, B4)를 연결한 영역에 대응될 수 있다. 또한, 기준좌표(B1, B2, B3, B4)와 모니터좌표(M1, M2, M3, M4)는 서로 좌우가 반전되므로, 영상으로부터 획득한 4개 지점에 대한 손 영역 좌표(G)를 좌우 반전시켜 기준좌표(B1, B2, B3, B4)를 생성한다. 사용자가 스스로 지정한 원격위치(T)는 영상 입력장치(1) 및 모니터(2)로부터 일정거리(D) 떨어져 있는 지점으로 임계거리보다 작거나 크지 않는 위치이며, 동시에 영상 입력장치(1)가 영상을 획득할 수 있는 화각 범위를 벗어나지 않는 위치로 정의한다. 11, when four monitor coordinates (M1, M2, M3, M4) are displayed on the monitor 2, the user sets the four monitor coordinates M1, M2, M3 , M4), the hand area coordinate (G) is displayed at four points in the air. The gesture recognition device 4 analyzes the image obtained by the image input device 1 and obtains the reference coordinates B1, B2, B3, B4). According to one embodiment, the gesture display area 3 may correspond to a region connecting the reference coordinates B1, B2, B3, and B4. Since the reference coordinates B1, B2, B3 and B4 and the monitor coordinates M1, M2, M3 and M4 are inverted left and right, the hand area coordinates G for the four points acquired from the image are reversed To generate reference coordinates (B1, B2, B3, B4). The remote location T specified by the user is a position that is less than or not greater than a critical distance from the image input device 1 and the monitor 2 to a point located a certain distance D away from the monitor 2. At the same time, Is defined as a position that does not deviate from the range of the angle of view that can be obtained.

제스처 인식 장치(4)는 영상 입력장치(1)가 획득한 영상에서 사용자의 손모양, 손동작 및 이들의 다양한 조합을 분석하고, 분석결과에 따라 각종 디바이스를 제어한다. 실시예에 따라, 제스처 인식 장치(4)는 기학습 된 손모양 및 손동작 분류기를 이용하여 손모양, 손동작 및 이들의 다양한 조합을 정확하게 인식할 수 있고, 상기 분류기는 앞서 설명한, 다양한 실시예에 따른 컨볼루션 신경망(CNN)을 이용하여 구현될 수 있다. 따라서, 상기 분류기는 정확성 높고 분석속도도 빠른 손 제스처 분류기로 실현될 수 있다. The gesture recognition device 4 analyzes a hand shape, a hand operation, and various combinations of the user's hand in the image acquired by the image input device 1, and controls various devices according to the analysis result. According to the embodiment, the gesture recognition device 4 can accurately recognize the hand shape, the hand gesture and various combinations thereof using the learned hand shape and the hand gesture classifier, and the classifier according to the above-described various embodiments Can be implemented using a convolutional neural network (CNN). Thus, the classifier can be realized as a hand gesture classifier with high accuracy and fast analysis speed.

한편, 이하 내용에서는, 사용자 신체의 일부인 손동작(제스처, gesture)을 예를 들어 서술하나, 기타 얼굴, 팔, 기타 다양한 신체의 일부에 대한 모양 또는 동작을 배제하는 것은 아니다. 또한, 이하에서 서술되는 손 제스처(gesture)는 손 동작 자체만을 지칭하는 것은 아니며, 손모양까지 포함하는 것으로 정의한다. In the following description, a hand gesture (gesture) that is a part of the user's body is described, but it does not exclude the shape or motion of a part of the body, such as other faces, arms, or the like. In addition, the hand gesture described below is not limited to the hand operation itself, but is defined as including a hand shape.

도 12는 도 11의 제스처 인식 장치를 설명하는 블럭도이며, 도 13은 실시예에 따라 제스처 검출의 예시를 보여주는 도면이다. FIG. 12 is a block diagram illustrating the gesture recognition apparatus of FIG. 11, and FIG. 13 is a diagram illustrating an example of gesture detection according to an embodiment.

도 12를 참고하면, 제스처 인식 장치(4)는 제스처 사용 검증부(41), 제스처 등록부(43), 그리고 기기 제어부(45)를 포함할 수 있으며, 이러한 구성요소는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합을 통해서 구현될 수 있다. 또한, 제스처 인식 장치(4)는 메모리와 하나 이상의 프로세서를 포함할 수 있으며, 제스처 사용 검증부(41), 제스처 등록부(43), 그리고 기기 제어부(45)의 기능은 상기 메모리에 저장되어, 상기 하나 이상의 프로세서에 의하여 실행되는 프로그램 형태로 제스처 인식 장치(4)에 구현될 수 있다.12, the gesture recognition apparatus 4 may include a gesture use verification unit 41, a gesture registration unit 43, and a device control unit 45, which may be implemented by hardware or software, And a combination of software. The functions of the gesture usage verification unit 41, the gesture registration unit 43, and the device control unit 45 are stored in the memory, and the functions of the gesture recognition unit 4, May be implemented in the gesture recognition device 4 in the form of a program executed by one or more processors.

제스처 사용 검증부(41)는, 도 11 및 도 12를 참고하면, 사용자가 임의로 지정한 원격위치(T)에서 제스처를 이용하여 주변 기기를 제어할 수 있는지 검증한다. 또한, 제스처 사용 검증부(41)는, 모니터(2)와 제스처 표시영역(3) 상호 간 대응관계를 나타내는 변환행렬을 연산한다. 11 and 12, the gesture use verification unit 41 verifies whether the user can control the peripheral device using the gesture at the arbitrarily specified remote position T. [ The gesture usage verification unit 41 also calculates a conversion matrix indicating a correspondence relationship between the monitor 2 and the gesture display area 3. [

원격위치(T)는 모니터(2)의 정면의 일정 영역 내에 임계 거리를 벗어나지 않는 위치로 정의할 수 있다. 따라서, 제스처 사용 검증부(41)는, 사용자의 현재위치가 상기 정의된 원격위치(T)를 만족하지 못한 경우, 예를 들어, 현재위치가 임계 거리를 벗어나거나, 영상 입력장치(1)가 영상을 획득할 수 있는 각도를 벗어나는 경우는 현재 위치에서는 제어할 수 없다고 판단할 수 있다. 이하, 설명은 사용자의 현재위치가 정의된 원격위치에 해당하여 제어가능 여부가 검증된 것을 전제로 한다. The remote position T may be defined as a position that does not deviate from a critical distance within a predetermined area of the front surface of the monitor 2. [ Therefore, the gesture usage verification unit 41 determines whether the current position of the user does not satisfy the defined remote position T, for example, when the current position is out of the critical distance, or the image input apparatus 1 If it is out of the angle at which the image can be acquired, it can be determined that it can not be controlled at the current position. Hereinafter, the description is based on the assumption that the user's current position corresponds to the defined remote position and thus the controllability is verified.

제스처 사용 검증부(41)는, 도 11 및 도 12를 참고하면, 모니터(2) 상에 4개의 모니터좌표(M1, M2, M3, M4)를 표시하여 사용자의 제스처를 유도한다. 사용자가 스스로 지정한 원격위치(T)에서 상기 4개의 모니터좌표(M1, M2, M3, M4)를 따라 허공의 4개 지점에 좌표(G)를 표시하면, 제스처 사용 검증부(41)는 영상 입력장치(1)가 획득한 손 영역 좌표(G)에 대한 영상을 분석하여 모니터(2)와 제스처 표시영역(3) 상호 간의 대응관계를 도출한다. 모니터(2)와 제스처 표시영역(3) 상호 간의 대응관계는, 모니터좌표(M1, M2, M3, M4)와 제스처 표시영역(3)의 기준좌표(B1, B2, B3, B4) 사이의 변환형렬(T)로 실현될 수 있으나 이에 한정되는 것은 아니다. 여기서, 변환형렬(T)은 원격위치(T)에서 손 제스처의 움직임에 대한 좌표가 모니터(2)상 좌표로 구현될 수 있도록 한다. 예를 들어, 4개의 모니터좌표(M1, M2, M3, M4)에 대한 행렬(M)과 검출된 기준좌표(B1, B2, B3, B4)에 대한 행렬(B)로 두면, M= T ×B식을 도출하고, 산술적 연산으로부터 변환형렬(T)은 T= M ×B^- ¹로부터 도출할 수 있다. Referring to FIGS. 11 and 12, the gesture use verification unit 41 displays four monitor coordinates (M1, M2, M3, and M4) on the monitor 2 to guide the user's gesture. The gesture use verification unit 41 displays the coordinates G at four positions of the air space along the four monitor coordinates M1, M2, M3 and M4 at the remote position T which the user himself designates, The image of the hand area coordinate G obtained by the device 1 is analyzed to derive the correspondence between the monitor 2 and the gesture display area 3. [ The correspondence between the monitor 2 and the gesture display area 3 is determined by the conversion between the monitor coordinates M1, M2, M3, M4 and the reference coordinates B1, B2, B3, B4 of the gesture display area 3 (T), but is not limited thereto. Here, the transformation matrix T allows the coordinates of the movement of the hand gesture at the remote position T to be implemented in the coordinates of the monitor 2 phase. For example, if the matrix M for the four monitor coordinates (M1, M2, M3, M4) and the matrix B for the detected reference coordinates (B1, B2, B3, B4) From the arithmetic operation, the transformation sequence T can be derived from T = M x B ^- ¹ .

제스처 등록부(43)는, 제스처 표시영역(3) 상에서 제어신호로 사용될 제스처와 제스처에 의해 제어될 기기를 사용자의 선택에 기초하여 등록할 수 있다. 이때, 사용자는 하나의 기기를 제어하는 제스처 종류를 다르게 선택함으로써, 기기를 다양한 제어신호로 제어할 수 있다. The gesture registering unit 43 can register a gesture to be used as a control signal and a device to be controlled by the gesture on the gesture display area 3 based on the user's selection. At this time, the user can control the device with various control signals by selecting different kinds of gestures for controlling one device.

기기 제어부(45)는, 제스처 사용 검증부(41)에서 설정된 기준좌표(B1, B2, B3, B4)의 영역, 즉 제스처 표시영역(3)에서 사용자에 의해 표시된 제스처 영상을 영상 입력장치(1)를 통해 전달받아 제스처를 검출하고 분류할 수 있다. 또한, 기기 제어부(45)는, 분류된 제스처에 따라 기기를 제어할 수 있다. 사용자는 제스처를 취하여 이벤트를 발생시킬 수 있고, 여러 모양의 제스처 조합을 통해, On, Off, 소리 재생, 볼률 조절 등과 같은 제어신호를 생성할 수 있고, 제어할 기기를 원격으로 제어할 수 있다. The device control unit 45 displays the gesture image displayed by the user in the area of the reference coordinates B1, B2, B3 and B4 set in the gesture use verification unit 41, that is, in the gesture display area 3, ) To detect and classify the gesture. Further, the device control section 45 can control the device according to the classified gesture. The user can take a gesture and generate an event, and can generate a control signal such as On, Off, Sound reproduction, Balance control and the like through a combination of gestures of various shapes, and can remotely control the device to be controlled.

실시예에 따라, 기기 제어부(45)는, 앞서 설명한 다양한 실시예에 따른 컨볼루션 신경망(CNN)을 이용하여 검출기를 구현할 수 있다. 영상 입력장치(1)가 획득한 영상에서 기기 제어부(45)는 상기 검출기(분류기)를 이용하여 손 제스처를 정밀도 높게 검출하고 분류할 수 있다. According to the embodiment, the device control unit 45 may implement the detector using the convolutional neural network (CNN) according to the various embodiments described above. The device control unit 45 can detect and classify the hand gesture with high precision using the detector (classifier) in the image acquired by the video input device 1. [

도 12를 참고하면, 기기 제어부(45)는 영상 입력장치(1)가 획득한 전체 영상에서 제스처 표시영역(3) 즉, 기준좌표(B1, B2, B3, B4)의 영역 내의 영상에서 움직임이 있는 부분을 탐색한다. 기기 제어부(45)는 이전 프레임과 현재 프레임의 차를 활용한 광류장(dense optical flow)을 기반으로 영상에서 움직임이 있는 부분을 탐색할 수 있으며, 광류장 기반의 모션 벡터를 추출하는 알고리즘(Lucas-Kanade 또는 Gunnar Farneback)을 이용할 수 있다. 12, the device control unit 45 controls the movement of the entire image acquired by the image input device 1 in the image within the area of the gesture display area 3, that is, the reference coordinates B1, B2, B3, B4 Navigate to the part where you are. The device control unit 45 can search for a moving part in an image based on a dense optical flow using a difference between a previous frame and a current frame, -Kanade or Gunnar Farneback) are available.

보다 상세하게, 기기 제어부(45)는 움직임이 있는 블록 내에서 모션벡터의 크기(magnitude)와 각도(angle)을 추출하고, 임의의 블록 내에서 모션벡터의 크기를 파악하여 임계치보다 큰 것들의 개수가 일정한 값보다 클 경우에 움직임이 있는 블록으로 파악을 한다. 기기 제어부(45)는 움직임이 진행되다가 멈춤이 있는 블록으로 판단될 경우에는 검출할 대상이 포함된 영역으로 간주한다. 실시예에 따라, 기기 제어부(45)는 검출 대상이 포함된 후보 영역을 선택함에 있어, 모션이 멈추는 블록 중 최상단의 블록 또는 최상단 모션벡터 좌표를 기준으로 검출 대상이 포함된 영역으로 보아 검출을 수행할 수 있다. 이는, 서있는 자세에서는 팔 및 손의 움직임 특성상 최상단에 위치한 블록 영역에서 손이 위치하게 되는 특성을 반영한 것이다. 기기 제어부(45)는 검출 대상이 포함된 것으로 판단된 블록 영역에서 특징을 추출하거나, 블록 영역에서 슬라이딩 윈도(sliding window)방법을 활용하여 후보 영역을 검출한 후 분류 방법을 적용하여 제스처를 인식할 수 있다. More specifically, the device control unit 45 extracts the magnitude and angle of the motion vector in the motion block, grasps the size of the motion vector in an arbitrary block, and determines the number of the motion vectors larger than the threshold Is greater than a certain value, the block is identified as a motion block. If the device control unit 45 determines that the motion is progressing and there is a block with a stop, the device control unit 45 regards the block as an area including an object to be detected. According to the embodiment, when selecting the candidate region including the detection target, the device control unit 45 performs detection based on the uppermost block or the region including the detection target on the basis of the uppermost motion vector coordinate among the blocks in which the motion stops can do. This reflects the characteristic that the hand is positioned in the block region located at the uppermost position in the standing posture in terms of arm and hand motion characteristics. The device control unit 45 extracts features from the block region determined to include the detection target, or detects the candidate region using the sliding window method in the block region, and then applies the classification method to recognize the gesture .

도 13을 참고하면, 테이블(451)은 모션 블록의 좌표와 광류장 모션벡터 값의 예를 보여준다. 예를 들어, 10×10 당 1개의 특징 포인트를 추출할 수 있으며, 모션벡터의 크기(magnitude)가 4.5 이상이면, 움직임이 있는 포이트라고 판단할 수 있다. 화면(453)은 모션의 변화가 있는 부분관 검출대상 블럭 및 검출된 손 영역의 예시를 보여준다. Referring to FIG. 13, a table 451 shows examples of coordinates of a motion block and an optical flow field motion vector value. For example, one feature point can be extracted per 10 × 10, and if the magnitude of the motion vector is 4.5 or more, it can be determined that the motion is a pointer. A screen 453 shows an example of a partial tube detection target block and a detected hand area with a change in motion.

본 명세서는 많은 특징을 포함하는 반면, 그러한 특징은 본 발명의 범위 또는 특허청구범위를 제한하는 것으로 해석되어서는 안 된다. 또한, 본 명세서에서 개별적인 실시예에서 설명된 특징들은 단일 실시예에서 결합되어 구현될 수 있다. 반대로, 본 명세서에서 단일 실시예에서 설명된 다양한 특징들은 개별적으로 다양한 실시예에서 구현되거나, 적절히 결합되어 구현될 수 있다.While the specification contains many features, such features should not be construed as limiting the scope of the invention or the scope of the claims. In addition, the features described in the individual embodiments herein may be combined and implemented in a single embodiment. Conversely, various features described in the singular < Desc / Clms Page number 5 > embodiments herein may be implemented in various embodiments individually or in combination as appropriate.

도면에서 동작들이 특정한 순서로 설명되었으나, 그러한 동작들이 도시된 바와 같은 특정한 순서로 수행되는 것으로, 또는 일련의 연속된 순서, 또는 원하는 결과를 얻기 위해 모든 설명된 동작이 수행되는 것으로 이해되어서는 안 된다. 특정 환경에서 멀티태스킹 및 병렬 프로세싱이 유리할 수 있다. 아울러, 상술한 실시예에서 다양한 시스템 구성요소의 구분은 모든 실시예에서 그러한 구분을 요구하지 않는 것으로 이해되어야 한다. 상술한 프로그램 구성요소 및 시스템은 일반적으로 단일 소프트웨어 제품 또는 멀티플 소프트웨어 제품에 패키지로 구현될 수 있다.Although the operations have been described in a particular order in the figures, it should be understood that such operations are performed in a particular order as shown, or that all described operations are performed to obtain a sequence of sequential orders, or a desired result . In certain circumstances, multitasking and parallel processing may be advantageous. It should also be understood that the division of various system components in the above embodiments does not require such distinction in all embodiments. The above-described program components and systems can generally be implemented as a single software product or as a package in multiple software products.

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 형태로 기록매체(시디롬, 램, 롬, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다. 이러한 과정은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있으므로 더 이상 상세히 설명하지 않기로 한다.The method of the present invention as described above can be implemented by a program and stored in a computer-readable recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto optical disk, etc.). Such a process can be easily carried out by those skilled in the art and will not be described in detail.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. The present invention is not limited to the drawings.

2: 컨볼루션 신경망 21: 컨볼루션 레이어
211: 제1 컨볼루션 레이어 213: 제2 컨볼루션 레이어
215: 제3 컨볼루션 레이어 23: 완전 연결 레이어2: Convolution Neural Network 21: Convolution Layer
211: first convolution layer 213: second convolution layer
215: third convolution layer 23: fully connected layer

Claims

A gesture classifier for learning parameters of a hand gesture detection convolutional neural network,
A convolution neural network consisting of a plurality of convolutional layers for calculating a feature map by performing a convolution operation and a complete connection layer for classifying detected images by analyzing feature maps calculated in the plurality of convolutional layers; And
And a learning engine that learns the convolution neural network to calculate parameters optimized for hand gesture detection
Wherein the plurality of convolution layers comprise a first convolution layer including a sub-sampling layer for reducing a size of a feature map calculated as a result of a convolution operation based on a detected image; A second convolution layer, implemented as a non-subsampling layer, that repeats the convolution operation based on the output of the first convolution layer; And a third convolution layer including a sub-sampling layer for reducing the size of the feature map calculated as a result of the convolution operation based on the output of the second convolution layer,
The type, number, and size of filters for performing the convolution operation are configured independently for each of the plurality of convolution layers, and the types, numbers, and sizes of the filters included in the plurality of convolution layers are individually The gesture classifier comprising:

The method according to claim 1,
The first convolution layer and the third convolution layer perform padding so that the size of the output is kept equal to the size of the input when the convolution operation is performed,
Wherein the parameter of the padding is configured independently for each of the plurality of convolution layers, and the parameter size of the padding is calculated separately for each of the plurality of convolution layers by the learning of the learning engine.

3. The method of claim 2,
Wherein the first convolution layer and the third convolution layer comprise:
Wherein the output of the normalization and the application of the nonlinear function to the feature map calculated by the convolution operation are transferred to the subsampling layer.

The method of claim 3,
Wherein the second convolution layer comprises:
A plurality of convolution operations for performing padding so that the size of the output is kept equal to the size of the input when the convolution operation is performed, performing the normalization on the feature map calculated by the convolution operation, and applying the non- Layers,
And the plurality of convolution layers constituting the second convolution layer are configured in a serial combination.

5. The method of claim 4,
Wherein the number of convolution layers constituting the second convolution layer is a number of convolution layers,
Wherein the gesture classifier is calculated by learning the learning engine.

The method of claim 3,
Wherein the second convolution layer comprises:
A first parallel layer for performing a normalization on a feature map calculated by a convolution operation on the output of the first convolution layer to calculate a first feature map;
A first layer that sequentially performs normalization and nonlinear function application on the feature map calculated by the convolution operation on the output of the first convolution layer and a feature map that is calculated by convolution operation on the output of the first layer A second parallel layer including a second layer for calculating a second feature map by applying normalization to the first layer and the second layer, the first layer and the second layer being formed by serial combination;
A fusion layer for performing a sum operation on the first feature map and the second feature map; And
And a noise reduction layer that performs a non-linear function application to the output of the fusion layer.

The method according to claim 6,
The second layer performs padding so that the size of the output is kept equal to the size of the input when the convolution operation is performed,
Wherein the parameters of the padding are computed separately for each of the plurality of convolution layers as the parameter size of the padding due to learning of the learning engine.

A gesture recognition device for recognizing a hand gesture and controlling a peripheral device,
A gesture usage verification unit for verifying a gesture display area at a user-specified remote location and calculating a correspondence relationship between the gesture display area and monitors so that coordinates of the gesture motion in the gesture display area are displayed in corresponding coordinates on a monitor;
A gesture registration unit for registering a device to be controlled by a user and a gesture used as a control signal; And
And a device controller for detecting and analyzing the gesture in the gesture image detected in the gesture display area and controlling the device according to a control command corresponding to the gesture registered by the gesture registering unit
Wherein the gesture display area is created by user definition at a position spaced a certain distance from the monitor as a user defined area in which the user delivers hand gesture information.

9. The method of claim 8,
Wherein the correspondence relationship between the gesture display area and the monitor is determined by:
Wherein the gesture recognition device is calculated on the basis of monitor coordinates displayed on a monitor and an extracted reference coordinate by analyzing an image of the hand area coordinates displayed by the user in the gesture display area along the monitor coordinates.

10. The method of claim 9,
Wherein the device control unit includes a gesture classifier for detecting a gesture in the gesture image and analyzing the type of the gesture,
Wherein the gesture classifier is implemented using a gesture detection convolutional neural network that includes learned parameters.

11. The method of claim 10,
The gesture classifier comprising:
A convolution neural network consisting of a plurality of convolutional layers for performing a convolution operation to calculate a feature map and a complete connection layer for classifying detected images by analyzing feature maps calculated by the plurality of convolutional layers; And
And a learning engine that learns the convolution neural network to calculate parameters optimized for hand gesture detection
Wherein the plurality of convolution layers comprise: a first convolution layer including a sub-sampling layer for reducing a size of a feature map calculated based on a detected image; A second convolution layer that includes a non-subsampling layer and repeats the convolution operation based on the output of the first convolution layer; And a third convolution layer including a sub-sampling layer for reducing the size of the feature map calculated on the basis of the output of the second convolution layer,
The type, number, and size of filters for performing the convolution operation are configured independently for each of the plurality of convolution layers, and the types, numbers, and sizes of the filters included in the plurality of convolution layers are individually The gesture recognition apparatus comprising:

12. The method of claim 11,
The first convolution layer and the third convolution layer perform padding so that the size of the output is kept equal to the size of the input when the convolution operation is performed,
Wherein the parameter of the padding is independently configured for each of the plurality of convolution layers, and the parameter size of the padding is calculated separately for each of the plurality of convolution layers by the learning of the learning engine.

13. The method of claim 12,
Wherein the first convolution layer and the third convolution layer comprise:
Wherein the output of the normalization and the application of the nonlinear function to the feature map calculated by the convolution operation are transferred to the subsampling layer.

14. The method of claim 13,
Wherein the second convolution layer comprises:
A plurality of convolution operations for performing padding so that the size of the output is kept equal to the size of the input when the convolution operation is performed, performing the normalization on the feature map calculated by the convolution operation, and applying the non- Layers,
And the plurality of convolution layers constituting the second convolution layer are constituted by series combination.

15. The method of claim 14,
Wherein the number of convolution layers constituting the second convolution layer is a number of convolution layers,
And the gesture recognition device is calculated by learning the learning engine.

14. The method of claim 13,
Wherein the second convolution layer comprises:
A first parallel layer for performing a normalization on a feature map calculated by a convolution operation on the output of the first convolution layer to calculate a first feature map;
A first layer that sequentially performs normalization and nonlinear function application on the feature map calculated by the convolution operation on the output of the first convolution layer and a feature map that is calculated by convolution operation on the output of the first layer A second parallel layer including a second layer for calculating a second feature map by applying normalization to the first layer and the second layer, the first layer and the second layer being formed by serial combination;
A fusion layer for performing a sum operation on the first feature map and the second feature map; And
And a noise reduction layer that performs a nonlinear function application to the output of the fusion layer.

17. The method of claim 16,
The second layer performs padding so that the size of the output is kept equal to the size of the input when the convolution operation is performed,
Wherein the parameter of the padding is calculated separately for each of the plurality of convolution layers by the learning of the learning engine.