KR102343963B1

KR102343963B1 - CNN For Recognizing Hand Gesture, and Device control system by hand Gesture

Info

Publication number: KR102343963B1
Application number: KR1020170067019A
Authority: KR
Inventors: 전은솜; 문일현; 권재철
Original assignee: 주식회사 케이티
Priority date: 2017-05-30
Filing date: 2017-05-30
Publication date: 2021-12-24
Also published as: KR20180130869A

Abstract

본 발명은 손 제스처를 검출하는 컨볼루션 신경망, 그리고 손 제스처에 의한 기기 제어시스템에 관한 것으로, 본 발명에 따른 제스처 분류기는, 손 제스처 검출 컨볼루션 신경망의 파라미터를 학습하는 제스처 분류기에 있어서, 컨볼루션 연산을 수행하여 특징맵을 산출하는 복수의 컨볼루션 레이어들과 상기 복수의 컨볼루션 레이어들에서 산출된 특징맵들을 분석하여 검출영상을 분류하는 완전 연결 레이어로 구성되는 컨볼루션 신경망; 및 상기 컨볼루션 신경망을 학습시켜 손 제스처 검출에 최적화된 파라미터를 산출하는 학습엔진;을 포함하고, 상기 복수의 컨볼루션 레이어들은, 검출영상을 기초로 컨볼루션 연산결과 산출된 특징맵의 크기를 줄이는 서브 샘플링 레이어를 포함하는 제1 컨볼루션 레이어; 비서브 샘플링 레이어로 구현되어 상기 제1 컨볼루션 레이어의 아웃풋을 기초로 컨볼루션 연산을 반복하는 제2 컨볼루션 레이어; 상기 제2 컨볼루션 레이어의 아웃풋을 기초로 컨볼루션 연산결과 산출된 특징맵의 크기를 줄이는 서브 샘플링 레이어를 포함하는 제3 컨볼루션 레이어;를 포함하고, 컨볼루션 연산을 수행하는 커널필터의 종류, 개수, 크기는 상기 복수의 컨볼루션 레이어마다 독립적으로 구성되며 상기 학습엔진의 학습으로 상기 복수의 컨볼루션 레이어에 포함되는 커널필터의 종류, 개수, 크기가 개별적으로 산출되는 것을 특징으로 한다.The present invention relates to a convolutional neural network for detecting hand gestures and a device control system based on hand gestures. The gesture classifier according to the present invention is a gesture classifier for learning parameters of a convolutional neural network for detecting hand gestures. a convolutional neural network comprising a plurality of convolutional layers for calculating a feature map by performing an operation and a fully connected layer for classifying a detected image by analyzing the feature maps calculated from the plurality of convolutional layers; and a learning engine that trains the convolutional neural network to calculate parameters optimized for hand gesture detection, wherein the plurality of convolutional layers reduce the size of a feature map calculated as a result of a convolution operation based on a detection image. a first convolutional layer including a sub-sampling layer; a second convolutional layer implemented as a non-subsampling layer to repeat a convolution operation based on an output of the first convolutional layer; A third convolution layer including a sub-sampling layer that reduces the size of a feature map calculated as a result of a convolution operation based on the output of the second convolution layer; and a type of kernel filter that performs a convolution operation, The number and size are independently configured for each of the plurality of convolutional layers, and the type, number, and size of kernel filters included in the plurality of convolutional layers are individually calculated by learning of the learning engine.

Description

CNN For Recognizing Hand Gesture, and Device control system by hand Gesture

본 발명은 손 제스처를 이용하여 기기를 제어하는 시스템에 관한 것으로, 구체적으로 손 제스처의 특징을 추출하는데 최적화된 컨볼루션 신경망(Convolutional Neural Network, 이하 "CNN") 구조를 설계하고, 상기 컨볼루션 신경망(CNN) 구조를 갖는 분류기를 이용하여 손 제스처를 분류하고 주변 기기를 제어하는 손 제스처에 의한 기기 제어시스템에 관한 것이다. The present invention relates to a system for controlling a device using hand gestures. Specifically, a convolutional neural network (“CNN”) structure optimized for extracting hand gesture features is designed, and the convolutional neural network It relates to a hand gesture-based device control system that classifies hand gestures using a classifier having a (CNN) structure and controls peripheral devices.

최근 마우스나 키보드 등의 입력장치에서 벗어나 인간의 자연스러운 동작인 제스처(gesture)를 인식하고, 그 인식결과를 매개로 사용자와 컴퓨팅 기기 사이의 의사소통을 가능하게 하는 내추럴 사용자 인터페이스(Natural User Interface; NUI)에 대한 연구가 활발하다. Recently, a natural user interface (NUI) that recognizes a gesture, a natural human movement, away from an input device such as a mouse or keyboard, and enables communication between a user and a computing device through the recognition result ) is being actively studied.

제스처를 인식하는 기술은 규칙기반 인식 기술과 학습기반 인식 기술 두 가지로 크게 구분할 수 있다. 규칙기반 인식 기술은 손바닥의 중심으로부터 일정한 임계값(Threshold)을 설정하고 임계값을 넘는 손 끝(Finger Tip)의 개수에 따라 손모양을 인식하는 방법이다. 학습기반 인식 기술은 인식 대상이 되는 손모양에 대한 DB를 취득하고 이를 학습하여 생성한 모델을 통해 손모양을 인식하는 방법이다.Gesture recognition technology can be roughly divided into rule-based recognition technology and learning-based recognition technology. Rule-based recognition technology sets a certain threshold from the center of the palm and recognizes hand shapes according to the number of finger tips that exceed the threshold. Learning-based recognition technology is a method of recognizing a hand shape through a model created by acquiring a DB for a hand shape to be recognized and learning it.

규칙기반 인식 기술은 사람마다 손 크기가 다르기 때문에 최적의 임계값(r)을 결정하는 데 어려움이 있다. 환경 변화가 생기는 경우에는, 최적의 임계값(r)을 설정하기 위하여 임계값을 재설정해야 하는 경우가 발생할 수 있으며, 결정된 임계값(r)이 최적의 임계값이 아닌 경우에는 인식률이 낮아져 성능이 저하되는 문제가 발생할 수도 있다. 그리고 규칙기반 인식 기술은 학습기반 인식 기술에 비하여 다양한 손모양을 인식하는 데 한계가 있다. The rule-based recognition technique has difficulty in determining the optimal threshold value (r) because the hand size is different for each person. When an environmental change occurs, it may be necessary to reset the threshold value in order to set the optimal threshold value (r). There may be problems with degradation. And the rule-based recognition technology has limitations in recognizing various hand shapes compared to the learning-based recognition technology.

학습기반 인식 기술은 제스처를 정확하게 분류해낼 수 있도록 설계된 학습 구조에 의해 복수의 데이터를 군집화하거나 분류하는 딥러닝(Deep Learning)에 기반한 기술이다. 특히, 객체 인식(object recognition) 분야에서는 딥러닝의 일종인 컨볼루션 신경망(Convolutional Neural Network, 이하 "CNN")이라는 기술이 각광받고 있으며, 컨볼루션 신경망(CNN)은 사람이 물체를 인식할 때 물체의 기본적인 특징들을 추출한 다음 뇌 속에서 복잡한 계산을 거쳐 그 결과를 기반으로 물체를 인식한다는 가정을 기반으로 만들어진 사람의 뇌 기능을 모사한 모델이다. 컨볼루션 신경망(CNN)에서는 기본적으로 컨볼루션(convolution) 연산을 통해 영상의 특징을 추출하기 위한 다양한 필터와 비선형적인 특성을 더하기 위한 풀링(pooling) 또는 비선형 활성화(non-linear activation) 함수 등이 함께 사용된다. Learning-based recognition technology is a technology based on deep learning that clusters or classifies a plurality of data by a learning structure designed to accurately classify gestures. In particular, in the field of object recognition, a technology called Convolutional Neural Network (hereinafter “CNN”), which is a kind of deep learning, is in the spotlight. It is a model that simulates the brain function of a person based on the assumption that basic features of In a convolutional neural network (CNN), various filters for extracting image features through a convolution operation and a pooling or non-linear activation function to add non-linear characteristics are basically included. used

그러나, 이러한 신경망 기술을 사용함에 있어서도, 적용되는 함수의 종류 및 연산의 구조를 어떻게 설계하는가에 따라 성능 결과가 첨예하게 달라진다. 따라서, 컨볼루션 신경망(CNN)을 목적에 맞게 적절하게 설계하는 것은 성능과 직결되는 매우 중요한 문제이다.However, even when such a neural network technique is used, the performance result sharply varies depending on the type of applied function and how the structure of the operation is designed. Therefore, properly designing a convolutional neural network (CNN) for a purpose is a very important issue directly related to performance.

한국 공개특허공보 제10-2010-0129629호 "움직임 검출에 의한 전자장치 동작 제어방법 및 이를 채용하는 장치"Korean Patent Application Laid-Open No. 10-2010-0129629 "Method for controlling operation of electronic device by motion detection and device employing the same"

앞서 본 종래 기술의 문제점을 해결하기 위하여 안출된 것으로, It has been devised to solve the problems of the prior art,

본 발명의 목적은, 다양한 손모양 및 손 제스처를 분류하는데 최적화된 컨볼루션 신경망(CNN) 구조를 설계하고, 설계된 컨볼루션 신경망(CNN)을 학습시켜 각종 파라미터를 자동으로 추출하는 손 제스처에 의한 기기 제어시스템을 제공하는 것이다. An object of the present invention is to design a convolutional neural network (CNN) structure optimized for classifying various hand shapes and hand gestures, and to learn the designed convolutional neural network (CNN) to automatically extract various parameters. to provide a control system.

본 발명의 다른 목적은, 손 제스처를 분류하는데 최적화된 컨볼루션 신경망(CNN)과 학습으로 추출된 파라미터로 구성된 분류기를 이용함으로써, 원거리 비접촉에 의한 손 제스처도 정확하게 분류하여 손 제스처에 의한 기기 제어 성능을 높이는 손 제스처에 의한 기기 제어시스템을 제공하는 것이다. Another object of the present invention is to accurately classify long-distance non-contact hand gestures by using a convolutional neural network (CNN) optimized for classifying hand gestures and a classifier composed of parameters extracted by learning, so that device control performance by hand gestures It is to provide a device control system by a hand gesture that raises the

본 발명의 또 다른 목적은, 고정된 위치나 기지정된 제어영역, 또는 제어할 기기가 이미 설정되어 있는 것이 아닌, 사용자가 스스로 원하는 위치와 원하는 제어영역을 설정하고 제어하고자 하는 주변기기 및 제어신호 또한 설정할 수 있는 손 제스처에 의한 기기 제어시스템을 제공하는 것이다. Another object of the present invention is to set the desired position and desired control area by the user, rather than having a fixed position, a predetermined control area, or a device to be controlled already set, and also set peripheral devices and control signals to be controlled. It is to provide a device control system by hand gestures.

일 측면에 따른 제스처 분류기는, 손 제스처 검출 컨볼루션 신경망의 파라미터를 학습하는 제스처 분류기에 있어서, 컨볼루션 연산을 수행하여 특징맵을 산출하는 복수의 컨볼루션 레이어들과 상기 복수의 컨볼루션 레이어들에서 산출된 특징맵들을 분석하여 검출영상을 분류하는 완전 연결 레이어로 구성되는 컨볼루션 신경망; 및 상기 컨볼루션 신경망을 학습시켜 손 제스처 검출에 최적화된 파라미터를 산출하는 학습엔진;을 포함하고, 상기 복수의 컨볼루션 레이어들은, 검출영상을 기초로 컨볼루션 연산결과 산출된 특징맵의 크기를 줄이는 서브 샘플링 레이어를 포함하는 제1 컨볼루션 레이어; 비서브 샘플링 레이어로 구현되어 상기 제1 컨볼루션 레이어의 아웃풋을 기초로 컨볼루션 연산을 반복하는 제2 컨볼루션 레이어; 상기 제2 컨볼루션 레이어의 아웃풋을 기초로 컨볼루션 연산결과 산출된 특징맵의 크기를 줄이는 서브 샘플링 레이어를 포함하는 제3 컨볼루션 레이어;를 포함하고, 컨볼루션 연산을 수행하는 커널필터의 종류, 개수, 크기는 상기 복수의 컨볼루션 레이어마다 독립적으로 구성되며 상기 학습엔진의 학습으로 상기 복수의 컨볼루션 레이어에 포함되는 커널필터의 종류, 개수, 크기가 개별적으로 산출되는 것을 특징으로 한다. A gesture classifier according to an aspect is a gesture classifier for learning parameters of a hand gesture detection convolutional neural network, a plurality of convolutional layers for calculating a feature map by performing a convolution operation, and the plurality of convolutional layers. a convolutional neural network composed of fully connected layers that classify detected images by analyzing the calculated feature maps; and a learning engine that trains the convolutional neural network to calculate parameters optimized for hand gesture detection, wherein the plurality of convolutional layers reduce the size of a feature map calculated as a result of a convolution operation based on a detection image. a first convolutional layer including a sub-sampling layer; a second convolutional layer implemented as a non-subsampling layer to repeat a convolution operation based on an output of the first convolutional layer; A third convolution layer including a sub-sampling layer that reduces the size of a feature map calculated as a result of a convolution operation based on the output of the second convolution layer; and a type of kernel filter that performs a convolution operation, The number and size are independently configured for each of the plurality of convolutional layers, and the type, number, and size of kernel filters included in the plurality of convolutional layers are individually calculated by learning of the learning engine.

상기 제1 컨볼루션 레이어 및 제3 컨볼루션 레이어는, 컨볼루션 연산 수행시 아웃풋의 크기가 인풋의 크기와 동일하게 유지되도록 패딩을 함께 수행하고, 상기 패딩의 파라미터는, 상기 복수의 컨볼루션 레이어 마다 독립적으로 구성되며 상기 학습엔진의 학습으로 패딩의 파라미터 크기는 상기 복수의 컨볼루션 레이어 마다 개별적으로 산출되는 것을 특징으로 한다. The first convolution layer and the third convolution layer perform padding together so that the size of the output is maintained the same as the size of the input when performing the convolution operation, and the padding parameter is set for each of the plurality of convolutional layers. It is independently configured and the parameter size of the padding is calculated individually for each of the plurality of convolutional layers by learning of the learning engine.

상기 제1 컨볼루션 레이어 및 제3 컨볼루션 레이어는, 컨볼루션 연산으로 산출된 특징맵에 대한 정규화와 비선형함수 적용을 차례로 수행한 아웃풋을 서브 샘플링 레이어로 전달하는 것을 특징으로 한다. The first convolutional layer and the third convolutional layer are characterized in that the output obtained by sequentially performing normalization and nonlinear function application on the feature map calculated by the convolution operation is transmitted to the subsampling layer.

상기 제2 컨볼루션 레이어는, 컨볼루션 연산 수행시 아웃풋의 크기가 인풋의 크기와 동일하게 유지되도록 패딩을 함께 수행하고, 컨볼루션 연산으로 산출된 특징맵에 대한 정규화와 비선형함수 적용을 차례로 수행하여 아웃풋을 산출하는 복수의 컨볼루션 레이어들을 포함하며, 상기 제2 컨볼루션 레이어를 구성하는 복수의 컨볼루션 레이어들은 직렬결합으로 구성되는 것을 특징으로 한다. The second convolution layer performs padding together so that the size of the output remains the same as the size of the input when performing the convolution operation, and performs normalization and application of a non-linear function on the feature map calculated by the convolution operation in sequence. A plurality of convolutional layers for calculating an output are included, and the plurality of convolutional layers constituting the second convolutional layer are configured by serial coupling.

상기 제2 컨볼루션 레이어를 구성하는 복수의 컨볼루션 레이어의 개수는, 상기 학습엔진의 학습으로 산출되는 것을 특징으로 한다. The number of the plurality of convolutional layers constituting the second convolutional layer is calculated by learning of the learning engine.

상기 제2 컨볼루션 레이어는, 상기 제1 컨볼루션 레이어의 아웃풋에 대한 컨볼루션 연산으로 산출된 특징맵에 정규화를 수행하여 제1 특징맵을 산출하는 제1 병렬 레이어; 상기 제1 컨볼루션 레이어의 아웃풋에 대한 컨볼루션 연산으로 산출된 특징맵에 정규화와 비선형함수 적용을 차례로 수행하는 제1 레이어와, 상기 제1 레이어의 아웃풋에 대한 컨볼루션 연산으로 산출된 특징맵에 대한 정규화를 적용하여 제2 특징맵을 산출하는 제2 레이어를 포함하고 상기 제1 레이어와 제2 레이어는 직렬결합으로 구성되는 제2 병렬 레이어; 상기 제1 특징맵과 상기 제2 특징맵에 대해 합 연산을 수행하는 퓨전 레이어; 및 상기 퓨전 레이어의 아웃풋에 대해 비선형함수 적용을 수행하는 노이즈 감소 레이어;를 포함하는 것을 특징으로 한다. The second convolution layer may include: a first parallel layer for calculating a first feature map by performing normalization on a feature map calculated by a convolution operation on an output of the first convolution layer; A first layer that sequentially applies normalization and a nonlinear function to a feature map calculated by a convolution operation on the output of the first convolution layer, and a feature map calculated by a convolution operation on the output of the first layer a second parallel layer comprising a second layer for calculating a second feature map by applying normalization, wherein the first layer and the second layer are serially combined; a fusion layer for performing a sum operation on the first feature map and the second feature map; and a noise reduction layer that applies a nonlinear function to the output of the fusion layer.

상기 제2 레이어는, 컨볼루션 연산 수행시 아웃풋의 크기가 인풋의 크기와 동일하게 유지되도록 패딩을 함께 수행하고, 상기 패딩의 파라미터는, 상기 학습엔진의 학습으로 패딩의 파라미터 크기는 상기 복수의 컨볼루션 레이어 마다 개별적으로 산출되는 것을 특징으로 한다. The second layer performs padding together so that the size of the output remains the same as the size of the input when the convolution operation is performed, and the padding parameter is the size of the padding parameter as the learning engine learns the plurality of convolutions. It is characterized in that it is calculated individually for each solution layer.

다른 측면에 따른 제스처 인식 장치는, 손 제스처를 인식하여 주변기기를 제어하는 제스처 인식 장치에 있어서, 사용자 지정 원격위치에서의 제스처 표시영역을 검증하고 상기 제스처 표시영역에서의 제스처 움직임에 대한 좌표가 모니터상 대응되는 좌표로 표시되도록 상기 제스처 표시영역과 모니터 상호 간 대응관계를 연산하는 제스처 사용 검증부; 사용자 지정으로 제어할 기기와 제어신호로 사용되는 제스처를 등록하는 제스처 등록부; 및 상기 제스처 표시영역에서 검출된 제스처 영상에서 제스처를 검출하고 분석하여 상기 제스처 등록부에 의해 등록된 제스처에 대응되는 제어명령에 따라 기기를 제어하는 기기 제어부;를 포함하고 상기 제스처 표시영역은, 사용자가 손 제스처 정보를 전달하는 사용자 지정 영역으로 모니터와 일정거리 이격된 위치에서 사용자 정의에 의해 생성되는 것을 특징으로 한다. A gesture recognition device according to another aspect is a gesture recognition device for recognizing a hand gesture to control a peripheral device. a gesture use verification unit that calculates a correspondence relationship between the gesture display area and the monitor so as to be displayed in corresponding coordinates; a gesture registration unit for registering a user-designated device to be controlled and a gesture used as a control signal; and a device control unit configured to detect and analyze a gesture from the gesture image detected in the gesture display area and control the device according to a control command corresponding to the gesture registered by the gesture registration unit; wherein the gesture display area includes: It is a user-specified area that transmits hand gesture information, and it is characterized in that it is created by user definition at a location spaced apart from the monitor by a certain distance.

상기 제스처 표시영역과 모니터 상호 간 대응관계는, 모니터상에 표시된 모니터좌표와, 상기 모니터좌표를 따라 사용자가 제스처 표시영역에 표시한 손 영역좌표에 대한 영상을 분석하여 추출된 기준좌표를 기초로 산출되는 것을 특징으로 한다. The correspondence between the gesture display area and the monitor is calculated based on the coordinates of the monitor displayed on the monitor and the reference coordinates extracted by analyzing the image for the coordinates of the hand area displayed by the user on the gesture display area along the monitor coordinates. characterized by being

상기 기기 제어부는, 제스처 영상에서 제스처를 검출하고 제스처 종류를 분석하는 제스처 분류기를 포함하고, 상기 제스처 분류기는, 학습된 파라미터를 포함하는 제스처 검출 컨볼루션 신경망을 이용하여 구현되는 것을 특징으로 한다. The device controller may include a gesture classifier that detects a gesture from a gesture image and analyzes a gesture type, wherein the gesture classifier is implemented using a gesture detection convolutional neural network including learned parameters.

상기 제스처 분류기는, 컨볼루션 연산을 수행하여 특징맵을 산출하는 복수의 컨볼루션 레이어들과 상기 복수의 컨볼루션 레이어들이 산출한 특징맵들을 분석하여 검출영상을 분류하는 완전 연결 레이어로 구성되는 컨볼루션 신경망; 및 상기 컨볼루션 신경망을 학습시켜 손 제스처 검출에 최적화된 파라미터를 산출하는 학습엔진;을 포함하고, 상기 복수의 컨볼루션 레이어들은, 검출영상을 기초로 산출된 특징맵의 크기를 줄이는 서브 샘플링 레이어를 포함하는 제1 컨볼루션 레이어; 비서브 샘플링 레이어를 포함하여 상기 제1 컨볼루션 레이어의 아웃풋을 기초로 컨볼루션 연산을 반복하는 제2 컨볼루션 레이어; 상기 제2 컨볼루션 레이어의 아웃풋을 기초로 산출된 특징맵의 크기를 줄이는 서브 샘플링 레이어를 포함하는 제3 컨볼루션 레이어;를 포함하고, 컨볼루션 연산을 수행하는 커널필터의 종류, 개수, 크기는 상기 복수의 컨볼루션 레이어 마다 독립적으로 구성되며 상기 학습엔진의 학습으로 상기 복수의 컨볼루션 레이어에 포함되는 커널필터의 종류, 개수, 크기가 개별적으로 산출되는 것을 특징으로 한다. The gesture classifier is a convolution comprising a plurality of convolutional layers that calculate a feature map by performing a convolution operation, and a fully connected layer that classifies a detected image by analyzing the feature maps calculated by the plurality of convolutional layers neural network; and a learning engine for calculating parameters optimized for hand gesture detection by learning the convolutional neural network, wherein the plurality of convolutional layers include a sub-sampling layer that reduces the size of the feature map calculated based on the detected image. a first convolutional layer comprising; a second convolutional layer that repeats a convolution operation based on an output of the first convolutional layer, including a non-subsampling layer; A third convolution layer including a sub-sampling layer that reduces the size of the feature map calculated based on the output of the second convolution layer; includes, and the type, number, and size of kernel filters performing convolution operation It is characterized in that each of the plurality of convolutional layers is independently configured, and the type, number, and size of kernel filters included in the plurality of convolutional layers are individually calculated by learning of the learning engine.

본 발명은 앞서 본 구성에 의하여 다음과 같은 효과를 가진다. The present invention has the following effects by the above configuration.

본 발명은, 손모양 및 손 제스처 분류에 최적화된 맞춤형 컨볼루션 신경망(CNN)의 설계구조를 제공하는 효과를 갖는다. The present invention has the effect of providing a design structure of a customized convolutional neural network (CNN) optimized for hand shape and hand gesture classification.

본 발명은, 손 제스처 맞춤형 컨볼루션 신경망(CNN)을 학습시켜 다양한 사람들의 손 모양이나 제스처로도 제어신호를 생성할 수 있도록 손 제스처 분류에 최적화된 각종 파라미터를 자동으로 추출할 수 있는 효과를 갖는다. The present invention has the effect of automatically extracting various parameters optimized for hand gesture classification so that a control signal can be generated even with a hand shape or gesture of various people by learning a hand gesture custom convolutional neural network (CNN) .

본 발명은, 손 제스처를 분류하는데 최적화된 컨볼루션 신경망(CNN)으로 구성된 분류기를 제공함으로써, 원거리 비접촉에 의한 손 제스처도 정확하게 분류함으로서 손 제스처에 의한 기기 제어 성능을 높이는 효과를 기대할 수 있다.The present invention provides a classifier composed of a convolutional neural network (CNN) optimized for classifying hand gestures, thereby accurately classifying hand gestures caused by remote non-contact, thereby increasing device control performance by hand gestures.

본 발명은, 사용자가 스스로 원하는 위치와 원하는 제어영역을 설정하고 제어하고자 하는 주변기기 및 제어신호 또한 자유롭게 설정할 수 있는 효과를 갖는다. The present invention has the effect that the user can set the desired position and the desired control area by himself/herself, and also freely set the peripheral devices and control signals to be controlled.

도 1은 일 실시예에 따른 제스처 분류기의 구성을 나타내는 블럭도이다.
도 2는 일 실시예에 따른 컨볼루션 신경망(CNN)의 구조를 도시한 도면이다.
도 3은 도 2의 컨볼루션 레이어에 의해 수행되는 컨볼루션 연산을 설명하는 개념도이다.
도 4는 도 2의 서브 샘플링 레이어에 의해 수행되는 풀링을 설명하는 개념도이다.
도 5는 도 2의 컨볼루션 신경망(CNN)에 대한 학습으로 파라미터 및 컨볼루션 레이어의 개수가 산출된 예시도이다.
도 6는 다른 실시예에 따른 컨볼루션 신경망(CNN)의 구조를 도시한 도면이다.
도 7은 도 6의 컨볼루션 신경망(CNN)에 대한 학습으로 파라미터가 산출된 예시도이다.
도 8은 도 6의 컨볼루션 신경망(CNN)의 구조에서 제2 컨볼루션 레이어가 복수개 직렬 연결된 예시도이다.
도 9는 실시예에 따라 컨볼루션 신경망(CNN)의 학습에 활용되는 학습영상의 예시도이다.
도 10는 실시예에 따른 컨볼루션 신경망(CNN) 구조를 이용하는 손 제스처 분류기가 도출한 최종 분류결과의 예시도이다.
도 11은 또 다른 실시예에 따른 제스처를 이용한 기기 제어 시스템을 보여주는 전체 개념도이다.
도 12는 도 11의 제스처 인식 장치를 설명하는 블럭도이다.
도 13은 실시예에 따라 제스처 검출의 예시를 보여주는 도면이다. 1 is a block diagram showing the configuration of a gesture classifier according to an embodiment.
2 is a diagram illustrating a structure of a convolutional neural network (CNN) according to an embodiment.
3 is a conceptual diagram illustrating a convolution operation performed by the convolution layer of FIG. 2 .
FIG. 4 is a conceptual diagram illustrating pooling performed by the sub-sampling layer of FIG. 2 .
FIG. 5 is an exemplary diagram in which parameters and the number of convolutional layers are calculated by learning for the convolutional neural network (CNN) of FIG. 2 .
6 is a diagram illustrating a structure of a convolutional neural network (CNN) according to another embodiment.
7 is an exemplary diagram in which parameters are calculated by learning for the convolutional neural network (CNN) of FIG. 6 .
8 is an exemplary diagram in which a plurality of second convolutional layers are connected in series in the structure of the convolutional neural network (CNN) of FIG. 6 .
9 is an exemplary diagram of a learning image used for learning of a convolutional neural network (CNN) according to an embodiment.
10 is an exemplary diagram of a final classification result derived by a hand gesture classifier using a convolutional neural network (CNN) structure according to an embodiment.
11 is an overall conceptual diagram illustrating a device control system using a gesture according to another embodiment.
12 is a block diagram illustrating the gesture recognition apparatus of FIG. 11 .
13 is a diagram illustrating an example of gesture detection according to an embodiment.

이하, 본 발명의 실시 예를 첨부된 도면들을 참조하여 더욱 상세하게 설명한다. 본 발명의 실시 예는 여러 가지 형태로 변형할 수 있으며, 본 발명의 범위가 아래의 실시 예들로 한정되는 것으로 해석되어서는 안 된다. 본 실시 예는 당업계에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해 제공되는 것이다. 또한, 본 발명의 도면과 명세서에서 특정한 용어들이 사용되었으나, 이는 단지 본 발명을 설명하기 위한 목적에서 사용된 것이지 의미 한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다. Hereinafter, an embodiment of the present invention will be described in more detail with reference to the accompanying drawings. Embodiments of the present invention may be modified in various forms, and the scope of the present invention should not be construed as being limited to the following embodiments. This embodiment is provided to more completely explain the present invention to those of ordinary skill in the art. In addition, although specific terms have been used in the drawings and the specification of the present invention, these are used only for the purpose of describing the present invention and are not used to limit the meaning or the scope of the present invention described in the claims. Therefore, it will be understood by those skilled in the art that various modifications and equivalent other embodiments are possible therefrom. Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

그러면, 도면을 참고하여 본 발명의 손 제스처를 검출하는 컨볼루션 신경망, 그리고 손 제스처에 의한 기기 제어시스템에 대하여 상세하게 설명한다. Next, a convolutional neural network for detecting a hand gesture of the present invention and a device control system using a hand gesture will be described in detail with reference to the drawings.

도 1은 일 실시예에 따른 제스처 분류기의 구성을 나타내는 블럭도이다. 1 is a block diagram showing the configuration of a gesture classifier according to an embodiment.

도 1을 참고하면, 제스처 분류기(1)는 컨볼루션 신경망(11), 그리고 학습엔진(13)을 포함하며, 이러한 구성요소는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합을 통해서 구현될 수 있다. 또한, 제스처 분류기(1)는 메모리와 하나 이상의 프로세서를 포함할 수 있으며, 컨볼루션 신경망(11), 학습엔진(13)의 기능은 상기 메모리에 저장되어, 상기 하나 이상의 프로세서에 의하여 실행되는 프로그램 형태로 상기 제스처 분류기(1)에 구현될 수 있다.Referring to FIG. 1 , the gesture classifier 1 includes a convolutional neural network 11 and a learning engine 13, and these components may be implemented as hardware or software, or may be implemented through a combination of hardware and software. . In addition, the gesture classifier 1 may include a memory and one or more processors, and the functions of the convolutional neural network 11 and the learning engine 13 are stored in the memory and executed by the one or more processors. can be implemented in the gesture classifier (1).

컨볼루션 신경망(11)은, 학습엔진(13)에 의해 깊이 있게 학습 되며, 일 실시예에 따라, 손 제스처 영상을 정밀도 높게 인식할 수 있다. 일 실시예에 따른 컨볼루션 신경망(11)은, 객체 인식(object recognition) 분야에서의 딥러닝(deep learning)의 일종이며, 특히, 손 제스처 또는 손모양 인식하는데 최적화된 CNN(Convolutional Neural Network) 구조로 설계될 수 있다. The convolutional neural network 11 is deeply learned by the learning engine 13, and according to an embodiment, can recognize a hand gesture image with high precision. The convolutional neural network 11 according to an embodiment is a kind of deep learning in the field of object recognition, and in particular, a CNN (Convolutional Neural Network) structure optimized for recognizing hand gestures or hand shapes can be designed as

학습엔진(13)은, 상기 컨볼루션 신경망(11)을 학습시켜 파라미터를 산출할 수 있다. 손은 손 벌림과 모아짐, 손의 빠른 이동, 회전, 손가락 모양의 다양한 변화를 취할 수 있고, 모양의 변화가 빠르고 크게 바뀔 수 있으며, 여러 가지 손 제스처를 동시에 활용하는 경우도 있다. 따라서, 실시예에 따른 컨볼루션 신경망(11) 구조를 제시하고, 상기 학습엔진(13)은 컨볼루션 신경망(11)을 학습시켜 손 제스처 검출에 최적화된 파라미터를 산출하여 다양한 손모양 또는 손 제스처를 정확하게 분류하는 컨볼루션 신경망(11) 구조를 완성할 수 있다. 여기서, 파라미터는, 필터(ex, 컨볼루션 연산을 수행하는 커널필터)의 종류, 개수, 크기뿐만 아니라, 이하 설명할 레이어의 개수 등도 포함한다. The learning engine 13 may calculate the parameters by learning the convolutional neural network 11 . The hand can take various changes in the shape of the fingers, such as opening and closing the hand, rapid movement of the hand, rotation, and the shape of the fingers. Accordingly, the structure of the convolutional neural network 11 according to the embodiment is presented, and the learning engine 13 learns the convolutional neural network 11 to calculate parameters optimized for hand gesture detection to perform various hand shapes or hand gestures. The structure of the convolutional neural network 11 that accurately classifies can be completed. Here, the parameter includes not only the type, number, and size of the filter (eg, a kernel filter performing a convolution operation), but also the number of layers to be described below.

이하, 도 2 내지 도 8에서, 다양한 실시예에 따른 컨볼루션 신경망(11) 구조를 설명하고, 상기 컨볼루션 신경망(11) 구조에 최적화된 파라미터 예시를 설명한다. Hereinafter, the structure of the convolutional neural network 11 according to various embodiments will be described with reference to FIGS. 2 to 8 , and examples of parameters optimized for the structure of the convolutional neural network 11 will be described.

도 2는 일 실시예에 따른 컨볼루션 신경망(CNN)의 구조를 도시한 도면이며, 도 3은 도 2의 컨볼루션 레이어에 수행되는 컨볼루션 연산을 설명하는 개념도이며, 도 4는 도 2의 서브 샘플링 레이어에 의해 수행되는 풀링을 설명하는 개념도이며, 도 5는 도 2의 컨볼루션 신경망(CNN)에 대한 학습으로 파라미터 및 컨볼루션 레이어의 개수가 산출된 예시도이다. FIG. 2 is a diagram illustrating the structure of a convolutional neural network (CNN) according to an embodiment, FIG. 3 is a conceptual diagram illustrating a convolution operation performed on the convolution layer of FIG. 2, and FIG. 4 is a sub of FIG. It is a conceptual diagram illustrating pooling performed by a sampling layer, and FIG. 5 is an exemplary diagram in which parameters and the number of convolutional layers are calculated by learning for the convolutional neural network (CNN) of FIG. 2 .

도 2를 참고하면, 컨볼루션 신경망은, 컨볼루션 레이어(21), 그리고 완전 연결 레이어(23)를 포함한다. Referring to FIG. 2 , the convolutional neural network includes a convolutional layer 21 and a fully connected layer 23 .

컨볼루션 레이어(21)는 컨볼루션 필터(또는 커널(kernel), 마스크(Mask))를 이용하여 입력된 영상에 컨볼루션 연산을 수행하고 특징맵(feature map)을 생성한다. 여기서, 컨볼루션 연산은 입력 영상 전 영역에서 가능한 모든 n×n 크기의 부분영역(또는 수용장)을 추출하고, 상기 n×n 크기의 부분영역의 각 값과 상기 부분영역의 크기에 대응하는 n×n 개의 파라미터로 구성되는 컨볼루션 필터의 각 단위 요소들을 각각 곱한 후 합산(즉, 필터와 부분영역 간의 내적 곱의 합)하는 것을 의미한다. 또한, 특징맵은 입력 영상의 다양한 특징이 표현된 영상 데이터를 의미하며, 산출된 특징맵의 개수는 컨볼루션 필터의 개수에 필수적으로 대응되는 것은 아니며 컨볼루션 연산의 방법에 따라 대응되지 않을 수 있다. The convolution layer 21 performs a convolution operation on an input image using a convolution filter (or a kernel, a mask) and generates a feature map. Here, the convolution operation extracts all possible n×n subregions (or receptive fields) from the entire input image, and each value of the n×n subregion and n corresponding to the size of the subregion This means that each unit element of the convolution filter composed of ×n parameters is multiplied and then summed (ie, the sum of the dot product products between the filter and the partial region). In addition, the feature map means image data in which various features of the input image are expressed, and the number of calculated feature maps does not necessarily correspond to the number of convolution filters, and may not correspond depending on the method of convolution operation. .

컨볼루션 레이어(21)는, 복수의 컨볼루션 레이어(L₁, L₂, L₃, …, L_N)를 포함하고, 상기 복수의 컨볼루션 레이어(L₁, L₂, L₃, …, L_N)는 기능에 따라 제1 컨볼루션 레이어(211: L₁), 제2 컨볼루션 레이어(213: L₂, L₃, …, L_N _-1), 제3 컨볼루션 레이어(215: L_N)로 구별될 수 있다. The convolutional layer 21 includes a plurality of convolutional layers L ₁ , L ₂ , L ₃ , …, L _N , and the plurality of convolutional layers L ₁ , L ₂ , L ₃ , …, L _N ) is a first convolutional layer 211 : L ₁ , a second convolutional layer 213 : L ₂ , L ₃ , ..., L _N _-1 ), and a third convolutional layer 215 : L according to a function. _N ) can be distinguished.

도 2를 참고하면, 일 실시예에 따른 컨볼루션 레이어(21)에서, 제1 컨볼루션 레이어(211) 및 제3 컨볼루션 레이어(215)는 서브 샘플링(subsampling) 또는 풀링(pooling)으로 특징맵의 크기를 줄이는 과정(POOL)을 수행하나, 제2 컨볼루션 레이어(213)는 특징맵의 크기를 줄이는 과정(POOL)을 수행하지 않는다. 따라서, 제1 컨볼루션 레이어(211)에서 컨볼루션 연산 및 풀링 과정 수행 이후, 제2 컨볼루션 레이어(213)에서는 풀링 과정 없이 컨볼루션 연산만 수차례 반복하여 아웃풋(output)인 특징맵의 수가 증가하도록 설계되어, 학습 및 분류하고자 하는 손 제스처 영상들이 갖는 각각의 특징들을 유지하면서 깊이 있는 학습이 가능하다. Referring to FIG. 2 , in the convolutional layer 21 according to an embodiment, the first convolutional layer 211 and the third convolutional layer 215 are feature maps by subsampling or pooling. A process (POOL) of reducing the size of is performed, but the second convolution layer 213 does not perform a process (POOL) of reducing the size of the feature map. Therefore, after performing the convolution operation and the pooling process in the first convolution layer 211, only the convolution operation is repeated several times without the pooling process in the second convolution layer 213, and the number of output feature maps increases. In-depth learning is possible while maintaining each characteristic of hand gesture images to be learned and classified.

제1 컨볼루션 레이어(211)는, 분류하고자 하는 영상(이하, 검출영상)을 입력영상(인풋, input)으로 입력받아 컨볼루션 연산(CONV)을 수행하여 특징맵을 생성하는 컨볼루션 레이어(L1a) 및 샘플링(sampling)이나 풀링(pooling)을 통해 특징맵의 크기를 감소시키는 서브 샘플링 레이어(L1b)를 포함한다. 컨볼루션 레이어(L1a)는, 일 실시예에 따라, 컨볼루션 연산 전후의 영상 크기가 동일하게 유지되도록 컨볼루션 연산 수행시 패딩(padding)을 함께 수행한다. 또한, 컨볼루션 레이어(L1a)는, 컨볼루션 연산으로 산출된 특징맵에 대한 정규화(normalization: NORM), 비선형함수(RELU 또는 PRELU) 적용을 차례로 수행한 결과를 서브 샘플링 레이어(L1b)로 전달한다. 즉, 컨볼루션 레이어(L1a)는 정규화(normalization) 및 비선형함수(RELU 또는 PRELU)가 적용된 특징맵을 서브 샘플링 레이어(L1b)로 전달한다. The first convolution layer 211 receives an image to be classified (hereinafter, a detection image) as an input image (input) and performs a convolution operation (CONV) to generate a feature map (L1a) ) and a sub-sampling layer L1b that reduces the size of the feature map through sampling or pooling. According to an embodiment, the convolution layer L1a performs padding together when performing the convolution operation so that the image size before and after the convolution operation is maintained to be the same. In addition, the convolution layer L1a transmits the result of sequentially applying normalization (NORM) and nonlinear function (RELU or PRELU) to the feature map calculated by the convolution operation to the sub-sampling layer L1b. . That is, the convolution layer L1a transmits the feature map to which the normalization and nonlinear function RELU or PRELU are applied to the sub-sampling layer L1b.

입력영상의 크기가 m x m 인 경우, n x n 인 부분영역(또는 수용장)을 모두 추출하여 컨볼루션 연산(CONV)하면, 아웃풋(output, 출력영상) 1장의 크기는 (m - (n - 1)) x (m - (n - 1))이 된다. 그에 따라, 컨볼루션 연산에 대한 아웃풋(출력영상)은 입력영상과 비교하면 가로와 세로가 각각 n - 1만큼 줄어들게 된다. 예를 들어, 크기가 6 x 6 인 인풋에 크기가 3 x 3 인 부분영역을 모두 추출하여 컨볼루션 연산을 적용하면, 아웃풋은 크기가 (6 - (3 - 1)) x (6 -(3 - 1)) = 4 x 4가 된다. 따라서, 일 실시예에 따라, 제1 컨볼루션 레이어(211)는, 아웃풋의 크기가 줄어드는 것을 방지하고, 인풋의 크기와 아웃풋의 크기를 같도록 패딩(padding) 기법을 수행한다. 패딩은 홀수의 n을 사용하여 입력 이미지의 상하좌우에 각각 [n / 2] 두께의 공백을 덧씌우는 것을 의미한다. 여기서 대괄호는 가우스 기호(또는 바닥 함수(floor function))를 나타낸다. If the size of the input image is mxm, if all nxn partial regions (or receptive fields) are extracted and convolution operation (CONV) is performed, the size of one output image is (m - (n - 1)) x (m - (n - 1)). Accordingly, the output (output image) of the convolution operation is reduced in width and length by n - 1, respectively, compared to the input image. For example, if a convolution operation is applied by extracting all subregions with a size of 3 x 3 to an input of size 6 x 6, the output has a size of (6 - (3 - 1)) x (6 - (3) - 1)) = 4 x 4. Therefore, according to an embodiment, the first convolution layer 211 prevents the size of the output from being reduced and performs a padding technique so that the size of the input and the size of the output are the same. Padding means using an odd number of n to cover the top, bottom, left, and right sides of the input image with [n / 2] thick spaces, respectively. Here, square brackets indicate Gaussian symbols (or floor functions).

또한, 인접 부분영역(또는 수용장) 사이의 간격을 스트라이드(stride)라고 지칭하고, 스트라이드가 1보다 크면 아웃풋의 가로 및 세로 길이는 각각 인풋의 가로 및 세로 길이보다 줄어들게 된다. 예를 들어, 스트라이드가 2인 경우, 아웃풋의 가로 및 세로 길이는 각 인풋의 가로 및 세로 길이의 절반이 된다. In addition, the distance between adjacent subregions (or receptive fields) is referred to as a stride, and when the stride is greater than 1, the horizontal and vertical lengths of the output become smaller than the horizontal and vertical lengths of the input, respectively. For example, if the stride is 2, the width and height of the output will be half the width and height of each input.

제2 컨볼루션 레이어(213)는, 샘플링(sampling) 또는 풀링(pooling)을 통해 특징맵의 크기를 감소시키는 서브 샘플링 레이어는 포함하지 않고, 컨볼루션 연산(CONV)을 수행하여 특징맵을 생성하는 복수의 컨볼루션 레이어(L₂, L₃, …, L_N-1)를 포함한다. 상기 복수의 컨볼루션 레이어(L₂, L₃, …, L_N-1)들은 앞선 레이어의 아웃풋을 다음 레이어의 인풋이 되도록 직렬로 연결된다. The second convolution layer 213 does not include a sub-sampling layer that reduces the size of the feature map through sampling or pooling, but performs a convolution operation (CONV) to generate the feature map. It includes a plurality of convolutional layers (L ₂ , L ₃ , ..., L _N-1 ). The plurality of convolutional layers L ₂ , L ₃ , ..., L _N-1 are connected in series so that the output of the previous layer becomes the input of the next layer.

제2 컨볼루션 레이어(213)를 구성하는 각 컨볼루션 레이어(L₂, L₃, …, L_N-1)들은, 컨볼루션 연산(CONV) 수행시 아웃풋의 크기가 인풋의 크기와 동일하게 유지되도록 패딩을 함께 수행하고, 컨볼루션 연산으로 산출된 특징맵에 대한 정규화(NORM)와 비선형함수(RELU 또는 PRELU) 적용을 차례로 수행하여 아웃풋, 즉 특징맵을 산출한다. Each of the convolution layers L ₂ , L ₃ , ..., L _N-1 constituting the second convolution layer 213 maintains the size of the output equal to the size of the input when the convolution operation CONV is performed. As much as possible, padding is performed together, and normalization (NORM) and nonlinear function (RELU or PRELU) are sequentially applied to the feature map calculated by the convolution operation to calculate the output, that is, the feature map.

제3 컨볼루션 레이어(215)는 제2 컨볼루션 레이어(213)의 아웃풋을 인풋으로 입력받아, 컨볼루션 연산(CONV)을 수행하여 특징맵을 생성하는 컨볼루션 레이어(L_NA)와, 샘플링(sampling) 또는 풀링(pooling)을 통해 특징맵의 크기를 감소시키는 서브 샘플링 레이어(L_Nb)를 포함한다. 상기 컨볼루션 레이어(L_Na)는, 일 실시예에 따라, 컨볼루션 연산(CONV) 수행시 아웃풋의 크기가 인풋의 크기와 동일하게 유지되도록 패딩을 함께 수행하고, 컨볼루션 연산으로 산출된 특징맵에 대한 정규화(NORM) 적용, 비선형함수(RELU 또는 PRELU) 적용을 차례로 수행한 아웃풋을 서브 샘플링 레이어(L_Nb)로 전달한다. 이때, 서브 샘플링 레이어(L_Nb)는 영상의 크기를 줄여 완전 연결 레이어(23)로 전달한다. The third convolution layer 215 receives the output of the second convolution layer 213 as an input, and performs a convolution operation (CONV) to generate a feature map (L _NA ) and sampling ( _{A sub-sampling layer (L Nb} ) for reducing the size of the feature map through sampling or pooling is included. The convolution layer (L _Na ) performs padding to keep the size of the output equal to the size of the input when performing the convolution operation (CONV) according to an embodiment, and a feature map calculated by the convolution operation The output obtained by sequentially applying normalization (NORM) and applying a nonlinear function (RELU or PRELU) to the sub-sampling layer (L _Nb ) is transferred. In this case, the sub-sampling layer L _Nb reduces the size of the image and transmits it to the fully connected layer 23 .

완전 연결 레이어(23)는, 컨볼루션 레이어(21)로부터 전달받은 영상 이미지를 분석하여 어떤 범주에 속하는지를 최종 판단한다. 완전 연결 레이어(23)는 전역 평균 통합(global average pooling) 또는 완전 연결 계층(fully-connected layer)으로 구현될 수 있다. 예를 들어, 완전 연결 레이어(23)는 컨볼루션 레이어(21)로부터 전달받은 아웃풋, 즉, 특징맵들을 분석하여 검출영상에 표시된 손 제스처가 가위, 바위, 보, 기타 다른 손모양에 해당하는지를 최종 판단할 수 있다. The fully connected layer 23 analyzes the video image received from the convolutional layer 21 to finally determine which category it belongs to. The fully connected layer 23 may be implemented as a global average pooling or a fully-connected layer. For example, the fully connected layer 23 analyzes the output received from the convolutional layer 21, that is, the feature maps, and finally determines whether the hand gesture displayed on the detected image corresponds to scissors, rock, paper, or other hand shapes. can judge

도 3을 참고하면, 일 실시예에 따라, 첫번째 컨볼루션 레이어(L₁)는 ℓ개의 특징맵(W₁), 두번째 컨볼루션 레이어(L₂)는 m개의 특징맵(W₂), 세번째 컨볼루션 레이어(L₃)는 n개의 특징맵(W₃), …, N번째 컨볼루션 레이어(L_N)는 g개의 특징맵(W_N)을 산출하는 것을 도시한다. 실시예에 따라, 하나의 인풋(100)에 대해 첫번째 컨볼루션 레이어(L₁)는 컨볼루션 연산(convolution) 결과 ℓ개의 특징맵을 생성하고, 두번째 컨볼루션 레이어(L₂)는 첫번째 컨볼루션 레이어(L₁)의 아웃풋인 ℓ개의 특징맵(W₁)을 입력받아 컨볼루션 연산을 수행하여 m개의 특징맵(W₂)을 생성한다. 동일하게, 세번째 컨볼루션 레이어(L₃)는 두번째 컨볼루션 레이어(L₂)의 아웃풋인 m 개의 특징맵(W₂)을 입력받아 컨볼루션 연산을 수행하여 n개의 특징맵(W₃)을 생성하고, 같은 방법으로 각 컨볼루션 레이어(L)를 통과하면서 컨볼루션 연산을 반복한다. 마지막 N번째 컨볼루션 레이어(L_N)의 컨볼루션 연산이 완료되면, 최종 g개의 특징맵(W_N)이 생성된다. 이때, 입력영상인 인풋(100)은 1개 이상의 채널을 가질 수 있다. 예를 들어, 입력영상인 인풋(100)이 8bit 영상인 경우, 1채널, 32bit 영상일 경우 3채널이다. 여기서, ℓ개의 특징맵(W₁)은 인풋(100)의 다양한 특징이 표현된 영상 데이터를 의미한다. Referring to FIG. 3 , according to an embodiment, the first convolutional layer (L ₁ ) includes ℓ feature maps (W ₁ ), the second convolution layer (L ₂ ) includes m feature maps (W ₂ ), and the third convolution The solution layer (L ₃ ) has n feature maps (W ₃ ), … , the N-th convolutional layer (L _N ) shows that g feature maps (W _N ) are calculated. According to an embodiment, for one input 100 , the first convolution layer (L ₁ ) generates ℓ feature maps as a result of convolution operation, and the second convolution layer (L ₂ ) is the first convolution layer It receives ℓ feature maps (W ₁ ) as the output of (L ₁ ) and performs convolution operation to generate m feature maps (W _{2 ).} Similarly, the third convolutional layer (L ₃ ) receives m feature maps (W ₂ ), which are the outputs of the second convolution layer (L ₂ ), and performs a convolution operation to generate _{n feature maps (W 3 )} and repeat the convolution operation while passing through each convolution layer (L) in the same way. When the convolution operation of the last N-th convolutional layer (L _N ) is completed, the final g feature maps (W _N ) are generated. In this case, the input 100, which is an input image, may have one or more channels. For example, when the input 100, which is an input image, is an 8-bit image, it is 1 channel, and in the case of a 32-bit image, it has 3 channels. Here, the ℓ feature maps (W ₁ ) mean image data in which various features of the input 100 are expressed.

컨볼루션 필터(또는 커널필터)는 각 컨볼루션 레이어(L) 마다 종류, 개수, 크기(n×n) 등이 다를 수 있을 뿐만 아니라, 동일 컨볼루션 레이어(L) 내에서 구현되는 복수의 컨볼루션 필터들도 각각 다른 종류로 구현될 수 있다. 컨볼루션 필터의 종류는 적색, 녹색, 청색과 같은 색감관련 필터이거나, 기타 다양한 손의 특징을 찾기 위한 특성을 갖는 필터로 구현될 수 있다. A convolution filter (or kernel filter) may have a different type, number, size (n×n), etc. for each convolutional layer (L), and a plurality of convolutions implemented within the same convolutional layer (L). Filters may also be implemented in different types. The type of the convolution filter may be a color-related filter such as red, green, or blue, or may be implemented as a filter having characteristics for finding various other hand features.

도 4를 참고하면, 일 실시예에 따라, 첫번째 컨볼루션 레이어(211: L₁)는 컨볼루션 레이어(211a) 및 서브 샘플링 레이어(211b)를 포함하고, 두번째 이후부터 N-1번째까지 복수의 컨볼루션 레이어(213: L₂, L₃, …, L_N-1)는 서브 샘플링 레이어 없이 컨볼루션 레이어만으로 구성되고, 그리고 마지막 N번째 컨볼루션 레이어(215: L_N)는 컨볼루션 레이어(215a) 및 서브 샘플링 레이어(215b)를 포함한다. Referring to FIG. 4 , according to an embodiment, the first convolutional layer 211 : L ₁ includes a convolutional layer 211a and a sub-sampling layer 211b, and includes a plurality of The convolutional layer 213: L ₂ , L ₃ , …, L _N-1 consists of only a convolution layer without a sub-sampling layer, and the last N-th convolution layer 215: L _N is a convolution layer 215a ) and a sub-sampling layer 215b.

예를 들어, 첫번째 컨볼루션 레이어(211: L₁)에서, 컨볼루션 레이어(211a)가 커널필터를 활용하여 하나의 인풋(100)에 대한 컨볼루션 연산(CONV) 결과 3개의 특징맵(101)을 산출하면, 서브 샘플링 레이어(211b)는 3개의 특징맵(101)에 풀링 또는 샘플링을 수행하여 크기가 감소된 특징맵(102)을 산출한다. 이후, 상기 특징맵(102)에 대해 두번째 이후부터 N-1번째까지 복수의 컨볼루션 레이어(213: L₂, L₃, …, L_N-1)에서 컨볼루션 연산이 반복 수행되어 특징맵의 수는 증가한다. 마지막 컨볼루션 레이어(215: L_N)에서, 컨볼루션 레이어(215a)에서 컨볼루션 연산이 완료되어 최종 g개의 특징맵(100’)이 산출되면, 서브 샘플링 레이어(215b)는 상기 특징맵(100’)에 풀링 또는 샘플링을 수행하여 크기가 감소된 특징맵(100”)을 산출한다. For example, in the first convolution layer 211: L ₁ , the convolution layer 211a utilizes a kernel filter to perform a convolution operation (CONV) for one input 100 , resulting in three feature maps 101 . , the sub-sampling layer 211b performs pooling or sampling on the three feature maps 101 to calculate the feature map 102 with a reduced size. Thereafter, the convolution operation is repeatedly performed on the _{plurality of convolutional layers 213: L 2} , L ₃ , ..., L _N-1 from the second to the N-1 th for the feature map 102, so that the number increases. In the last convolution layer 215: L _N , when the convolution operation is completed in the convolution layer 215a and the final g feature maps 100' are calculated, the sub-sampling layer 215b is the feature map 100 ') by performing pooling or sampling to calculate a feature map 100” with a reduced size.

도 5는, 학습엔진(13)이 도 2에 도시된 구조로 설계된 컨볼루션 신경망을 통해 학습한 학습 데이터를 보여주며, 상기 학습데이터는 손 제스처 검출에 최적화된 파라미터뿐만 아니라, 직렬결합된 컨볼루션 레이어(21)의 레이어 개수(N), 즉, 제2 컨볼루션 레이어(213: L₂, L₃, …, L_N-1)의 개수도 포함된다. 5 shows learning data learned by the learning engine 13 through a convolutional neural network designed with the structure shown in FIG. The number of layers N of the layer 21, that is, the number of second convolutional layers 213 (L ₂ , L ₃ , ..., L _N-1 ) is also included.

도 5를 참고하면, 컨볼루션 레이어(21)는 총 10개(N=10)의 레이어로 구성되며, 이중 제2 컨볼루션 레이어(213)는 8개의 레이어가 직렬로 연결되어 구성된다. Referring to FIG. 5 , the convolutional layer 21 is composed of a total of 10 (N=10) layers, and the second convolutional layer 213 is composed of 8 layers connected in series.

실시예에 따라, 제1 컨볼루션 레이어(211: L₁)에서, 컨볼루션 레이어(L_1a)는 3×3(ker 3) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)을 수행하고, 컨볼루션 연산으로 산출된 맵에 정규화(NORM)와 비선형함수(PRELU) 적용을 차례로 수행하여 최종 특징맵 16개(out 16)를 산출한다. 이때, 컨볼루션 레이어(L_1a)는 컨볼루션 연산 수행시, 상기 컨볼루션 연산 결과인 특징맵의 상하좌우에 각각[n/2=1/2] 두께의 공백을 덧씌우는 패딩(ex, zero padding, n=1)을 수행한다. 이후, 서브 샘플링 레이어(L_1b)는 크기가 3×3(ker 3)이고, 인접 수용장 사이 간격이 2(stride 2)이며, 최대값(max)을 뽑는 풀링을 수행하여 상기 특징맵의 크기를 줄인다. 따라서, 하나의 인풋(검출영상)에 대해, 제1 컨볼루션 레이어(211)의 아웃풋(즉,특징맵)은 16개이다. According to an embodiment, in the first convolution layer 211: L ₁ , the convolution layer L _1a performs a convolution operation (CONV) using a kernel filter having a size of 3×3 (ker 3), Normalization (NORM) and nonlinear function (PRELU) are sequentially applied to the map calculated by the convolution operation, thereby yielding 16 final feature maps (out 16). At this time, the convolution layer (L _1a ) is padding (ex, zero padding) overlaid with [n/2=1/2] thick spaces on the top, bottom, left, and right of the feature map, which is the result of the convolution operation, when performing the convolution operation. , n = 1). Thereafter, the sub-sampling layer L _1b has a size of 3×3 (ker 3), an interval between adjacent receptive fields is 2 (stride 2), and performs pooling to extract a maximum value (max) to increase the size of the feature map. reduce Accordingly, for one input (detected image), the number of outputs (ie, feature maps) of the first convolutional layer 211 is 16.

제2 컨볼루션 레이어(213: L₂, L₃, …, L₉)는, 컨볼루션 연산을 반복하여 수행한다. 즉, 2번째부터 9번째까지 컨볼루션 레이어(213: L₂, L₃, …, L₉)들은 앞선 레이어의 아웃풋이 연이은 레이어의 인풋으로 입력되도록 서로 직렬로 연결되며, 모두 컨볼루션 연산(CONV), 정규화(NORM), 비선형함수(PRELU) 적용을 차례로 수행하여 아웃풋 즉, 특징맵을 산출한다. The second convolution layer 213: L ₂ , L ₃ , ..., L ₉ repeatedly performs a convolution operation. That is, the 2nd to 9th convolutional layers 213: L ₂ , L ₃ , …, L ₉ are connected in series so that the output of the previous layer is input as the input of the subsequent layer, and all convolution operations (CONV) ), normalization (NORM), and nonlinear function (PRELU) are sequentially applied to calculate an output, that is, a feature map.

일 실시예에 따라, 두번째 레이어(L₂)는, 1×1(ker 1) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)을 수행하고 패딩(n=0)을 수행하여 특징맵 16개(out 16)를 생성한다. 세번째 레이어(L₃)는, 1×1(ker 1) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)과 패딩(n=0)을 수행하여 특징맵 32개(out 32)를 생성한다. 네번째 레이어(L₄)는, 1×1(ker 1) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)과 패딩(n=0)을 수행하여 특징맵 64개(out 64)를 생성한다. 다섯번째 레이어(L₅)는, 3×3(ker 3) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)과 패딩(n=1)을 수행하여 특징맵 64개(out 64)를 생성한다. 여섯번째 레이어(L₆)는, 1×1(ker 1) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)과 패딩(n=0)을 수행하여 특징맵 64개(out 64)를 생성한다. 일곱번째 레이어(L₇)는, 3×3(ker 3) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)과 패딩(n=1)을 수행하여 특징맵 128개(out 128)를 생성한다. 여덟번째 레이어(L₈)는, 1×1(ker 1) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)과 패딩(n=0)을 수행하여 특징맵 128개(out 128)를 생성한다. 아홉번째 레이어(L₉)는, 3×3(ker 3) 크기인 커널필터를 활용하여 컨볼루션 연산(CONV)과 패딩(n=1)을 수행하여 특징맵 256개(out 256)를 생성한다. 상기 파라미터들은, 학습엔진(13)이 도 2에 도시된 구조로 설계된 컨볼루션 신경망을 통해 학습한 학습 데이터이며, 손 제스처 검출에 최적화된 실시예이다. According to an embodiment, the second layer (L ₂ ) performs a convolution operation (CONV) using a kernel filter having a size of 1×1 (ker 1) and performs padding (n=0) to obtain 16 feature maps. (out 16) is created. The third layer (L ₃ ) generates 32 feature maps (out 32) by performing a convolution operation (CONV) and padding (n=0) using a kernel filter having a size of 1×1 (ker 1). The fourth layer (L ₄ ) generates 64 feature maps (out 64) by performing convolution operation (CONV) and padding (n=0) using a kernel filter having a size of 1×1 (ker 1). The fifth layer (L ₅ ) generates 64 feature maps (out 64) by performing convolution operation (CONV) and padding (n=1) using a kernel filter having a size of 3×3 (ker 3). . The sixth layer (L ₆ ) generates 64 feature maps (out 64) by performing convolution operation (CONV) and padding (n=0) using a kernel filter having a size of 1×1 (ker 1). . The seventh layer (L ₇ ) generates 128 feature maps (out 128) by performing convolution operation (CONV) and padding (n=1) using a kernel filter having a size of 3×3 (ker 3). . The eighth layer (L ₈ ) generates 128 feature maps (out 128) by performing convolution operation (CONV) and padding (n=0) using a kernel filter having a size of 1×1 (ker 1). . The ninth layer (L ₉ ) generates 256 feature maps (out 256) by performing convolution operation (CONV) and padding (n=1) using a kernel filter having a size of 3×3 (ker 3). . The parameters are learning data learned by the learning engine 13 through a convolutional neural network designed with the structure shown in FIG. 2 , and are an embodiment optimized for hand gesture detection.

따라서, 제2 컨볼루션 레이어(213: L₂, L₃, …, L₉)에서는 컨볼루션 연산이 반복 수행된다. 이때, 복수의 컨볼루션 레이어(213: L₂, L₃, …, L₉) 각각에서 컨볼루션 연산 수행시, 컨볼루션 연산 결과인 특징맵의 상하좌우에 각각[n/2] 두께의 공백을 덧씌우는 패딩(ex, zero padding)도 함께 수행된다. 실시예에 따라, 제2 컨볼루션 레이어(213)에서는 풀링 또는 샘플링 처리 없이 컨볼루션 연산만 반복된다. Accordingly, in the second convolution layer 213 : L ₂ , L ₃ , ..., L ₉ , the convolution operation is repeatedly performed. At this time, when the convolution operation is performed in each of the plurality of convolution layers 213: L ₂ , L ₃ , ..., L ₉ , a space of [n/2] thickness is formed on the top, bottom, left, and right sides of the feature map that is the result of the convolution operation. Overlay padding (ex, zero padding) is also performed. According to an embodiment, only the convolution operation is repeated without pooling or sampling processing in the second convolution layer 213 .

마지막 10번째 컨볼루션 레이어(L₁₀), 즉 제3 컨볼루션 레이어(215)의 컨볼루션 레이어(L_10a)는 제2 컨볼루션 레이어(213)의 아웃풋(특징맵)을 인풋으로 하여 256개의 커널필터(W₁₀)에 의한 컨볼루션 연산을 수행하고, 컨볼루션 연산(CONV)으로 산출된 아웃풋에 정규화(NORM)와 비선형함수(PRELU) 적용을 차례로 수행하여 256개에 대한 최종 아웃풋(특징맵)을 산출한다. 이때, 컨볼루션 레이어(L_10a)는 컨볼루션 연산 수행시, 상기 컨볼루션 연산 결과인 특징맵의 상하좌우에 각각[n/2=1/2] 두께의 공백을 덧씌우는 패딩(ex, zero padding, n=1)을 수행한다. 이후, 서브 샘플링 레이어(L_10b)는, 크기가 5×5(ker 5)이고, 인접 수용장 사이 간격이 2(stride 2)이며, 평균값(average)을 뽑는 풀링(POOL)을 수행하여 상기 특징맵들의 크기를 줄인다. The last tenth convolutional layer (L ₁₀ ), that is, the convolutional layer (L _10a ) of the third convolutional layer 215 , uses the output (feature map) of the second convolutional layer 213 as an input to include 256 kernels. The convolution operation by the filter (W ₁₀ ) is performed, and the normalization (NORM) and the nonlinear function (PRELU) are sequentially applied to the output calculated by the convolution operation (CONV), and the final output (feature map) for 256 to calculate In this case, the convolution layer (L _10a ) is padding (ex, zero padding) overlaid with [n/2=1/2] thick spaces on the top, bottom, left, and right sides of the feature map, which is the result of the convolution operation, when performing the convolution operation. , n = 1). Thereafter, the sub-sampling layer (L _10b ) has a size of 5×5 (ker 5), an interval between adjacent receptive fields is 2 (stride 2), and performs a pooling (POOL) of pulling an average value to obtain the above characteristics Reduce the size of the maps.

하나의 검출영상에 대해, 최종 컨볼루션 레이어(21)의 아웃풋, 즉, 특징맵은 256개이고, 상기 특징맵들은 완전 연결 레이어(23: FC)로 전달되어, 검출영상이 어떤 범주에 속하는지 판단하는 자료가 된다. For one detection image, the output of the final convolutional layer 21, that is, feature maps, is 256, and the feature maps are transmitted to the fully connected layer 23 (FC) to determine which category the detected image belongs to. becomes a material for

학습엔진(13)은 도 2에 도시된 설계구조로 구성되는 컨볼루션 신경망을 통해 학습하여, 도 5에 도시된 바와 같이, 손 제스처 분류에 가장 적합한 컨볼루션 레이어(21)를 구성하는 레이어 개수(N=10)를 생성할 수 있다. 또한, 컨볼루션 연산을 수행하는 커널필터(W)의 종류, 개수(out), 크기(ker)는 상기 복수의 컨볼루션 레이어(L₁, L₂, L₃, …, L₉, L₁₀) 마다 동일하게 구성되지 않고 각각 다르게 독립적으로 구성될 수 있으며, 학습엔진(13)의 학습으로 복수의 컨볼루션 레이어(L₁, L₂, L₃, …, L₉, L₁₀)에 포함되는 커널필터(W)의 종류, 개수(out), 크기(ker)와 같은 파라미터도 앞서 검토한대로, 개별적으로 산출될 수 있다. 또한, 컨볼루션 연산 수행시 패딩의 파라미터(n)도 학습엔진(13)의 학습으로 복수의 컨볼루션 레이어(L₁, L₂, L₃, …, L₉, L₁₀) 마다 개별적으로 산출될 수 있다. The learning engine 13 learns through the convolutional neural network composed of the design structure shown in Fig. 2, and as shown in Fig. 5, the number of layers constituting the convolutional layer 21 most suitable for hand gesture classification ( N=10) can be generated. In addition, the type, number (out), and size (ker) of the kernel filter (W) performing the convolution operation are the plurality of convolutional layers (L ₁ , L ₂ , L ₃ , ..., L ₉ , L ₁₀ ) Kernels included in the _{plurality of convolutional layers (L 1} , L ₂ , L ₃ , …, L ₉ , L ₁₀ ) by learning of the learning engine 13 are not identically configured, but may be configured independently. As discussed above, parameters such as the type, number (out), and size (ker) of the filter (W) may be individually calculated. In addition, when performing a convolution operation, the parameter (n) of the padding is also individually calculated for each _{of the plurality of convolutional layers (L 1} , L ₂ , L ₃ , ..., L ₉ , L _{10 ) by learning of the learning engine 13 .} can

도 6는 다른 실시예에 따른 컨볼루션 신경망(CNN)의 구조를 도시한 도면이며, 도 7은 도 6의 컨볼루션 신경망(CNN)에 대한 학습으로 파라미터가 산출된 예시도이며, 도 8은 도 6의 컨볼루션 신경망(CNN)의 구조에서 제2 컨볼루션 레이어가 복수개 직렬 연결된 예시도이다. 6 is a diagram showing the structure of a convolutional neural network (CNN) according to another embodiment, FIG. 7 is an exemplary diagram in which parameters are calculated by learning for the convolutional neural network (CNN) of FIG. 6, and FIG. 6 is an exemplary diagram in which a plurality of second convolutional layers are connected in series in the structure of the convolutional neural network (CNN).

도 6을 참고하면, 다른 실시예에 따른 컨볼루션 신경망은, 컨볼루션 레이어(21: 211, 213, 215), 그리고 완전 연결 레이어(23)를 포함한다. Referring to FIG. 6 , a convolutional neural network according to another embodiment includes convolutional layers 21 ( 211 , 213 , 215 ) and a fully connected layer 23 .

컨볼루션 레이어(21)는, 5개의 컨볼루션 레이어(L₁, L₂, L₃, L₄, L₅)를 포함하고, 상기 5개의 컨볼루션 레이어(L₁, L₂, L₃, L₄, L₅)는 기능에 따라 제1 컨볼루션 레이어(211: L₁), 제2 컨볼루션 레이어(213: L₂, L₃, L₄), 제3 컨볼루션 레이어(215: L₅)로 구별될 수 있다. The convolutional layer 21 includes five convolutional layers L ₁ , L ₂ , L ₃ , L ₄ , L ₅ , and the five convolutional layers L ₁ , L ₂ , L ₃ , L ₄ , L ₅ ) is a first convolutional layer ( 211 : L ₁ ), a second convolutional layer ( 213 : L ₂ , L ₃ , L ₄ ), and a third convolutional layer ( 215 : L ₅ ) according to functions can be distinguished as

도 6에 도시된 컨볼루션 신경망은, 도 2에 도시된 컨볼루션 신경망과 비교하여 제2 컨볼루션 레이어(213)만 상이하고, 나머지 구성은 동일하도록 설계될 수 있다. 즉, 도 6의 제1 컨볼루션 레이어(211: L₁) 및 제3 컨볼루션 레이어(215: L₅)는 도 2에 도시된 제1 컨볼루션 레이어(211: L₁) 및 제3 컨볼루션 레이어(215: L_N)에 각각 대응되나, 해당 레이어의 파라미터들은 학습엔진(13)의 학습으로 다르게 산출될 수 있다. 따라서, 도 6을 참고하면, 제1 컨볼루션 레이어(211: L₁) 및 제3 컨볼루션 레이어(215: L₅)는 특징맵의 크기를 줄이는 서브 샘플링(subsampling) 또는 풀링(pooling) 과정을 수행하나, 제2 컨볼루션 레이어(213: L₂, L₃, L₄)는 특징맵의 크기를 줄이는 과정을 수행하지 않는다. The convolutional neural network shown in FIG. 6 may be designed such that only the second convolutional layer 213 is different from the convolutional neural network shown in FIG. 2 , and the rest of the configuration is the same. That is, a first convolutional layer of Figure 6 (211: L ₁₎ and the third convolution layer (215: L ₅₎ is a first convolutional layer shown in Figure 2 (211: L ₁₎ and the third convolution Each corresponds to the layer 215: L _N , but parameters of the corresponding layer may be calculated differently through learning of the learning engine 13 . Therefore, referring to FIG. 6 , the first convolution layer 211 : L ₁ and the third convolution layer 215 : L _{5 perform a} subsampling or pooling process for reducing the size of the feature map. However, the second convolution layer 213 (L ₂ , L ₃ , L ₄ ) does not perform a process of reducing the size of the feature map.

도 6과 같이 설계된 컨볼루션 신경망은, 도 2와 비교하여, 컨볼루션 연산 수행 횟수를 줄여 처리속도를 높임과 동시에, 단순히 컨볼루션 연산 횟수만을 줄인 것이 아니라 한번의 컨볼루션 연산을 수행한 특징맵과 두번의 컨볼루션 연산을 수행한 특징맵을 하나의 맵으로 합친 후 합쳐진 특징맵을 다음 단계에서 활용하도록 설계하여 손 제스처 인식의 정확성도 유지될 수 있도록 설계되었다. 도 6의 제1 컨볼루션 레이어(211: L₁) 및 제3 컨볼루션 레이어(215: L₅)는 도 2에 도시된 제1 컨볼루션 레이어(211: L₁) 및 제3 컨볼루션 레이어(215: L₇)에 각각 대응되므로, 이하에서는, 제2 컨볼루션 레이어(213: L₂, L₃, L₄)에 대해서 자세하게 설명한다. Compared with FIG. 2, the convolutional neural network designed as shown in FIG. 6 increases the processing speed by reducing the number of convolution operations performed, and at the same time, not only reduces the number of convolution operations, but also features map and It is designed so that the accuracy of hand gesture recognition can be maintained by combining the feature maps that have performed two convolution operations into one map and then using the combined feature map in the next step. The first convolutional layer 211: L ₁ and the third convolutional layer 215: L ₅ of FIG. 6 are the first convolutional layer 211: L ₁ and the third convolutional layer ( Since each corresponds to 215 : L ₇ , the second convolutional layer 213 : L ₂ , L ₃ , and L ₄ will be described in detail below.

제2 컨볼루션 레이어(213: L₂, L₃, L₄)는, 제1 병렬 레이어(2131), 제2 병렬 레이어(2133: 2133a, 2133b), 퓨전 레이어(2135), 그리고 노이즈 감소 레이어(2137)를 포함한다. 제1 병렬 레이어(2131)와 제2 병렬 레이어(2133: 2133a, 2133b)는 병렬로 연결된다. The second convolutional layer 213: L ₂ , L ₃ , L ₄ includes a first parallel layer 2131 , a second parallel layer 2133: 2133a, 2133b , a fusion layer 2135 , and a noise reduction layer ( 2137). The first parallel layer 2131 and the second parallel layer 2133 (2133a, 2133b) are connected in parallel.

제1 병렬 레이어(2131)는, 상기 제1 컨볼루션 레이어(211: L₁)의 아웃풋을 인풋으로 하여 컨볼루션 연산(CONV)을 수행하고 컨볼루션 연산으로 산출된 특징맵에 정규화(NORM)를 수행하여 제1 특징맵을 산출한다. The first parallel layer 2131 performs a convolution operation (CONV) with the output of the first convolution layer 211: L ₁ as an input, and performs normalization (NORM) on the feature map calculated by the convolution operation. to calculate the first feature map.

제2 병렬 레이어(2133: 2133a, 2133b)는, 제1 컨볼루션 레이어(211: L₁)의 아웃풋을 인풋으로 컨볼루션 연산(CONV)을 수행하고 컨볼루션 연산으로 산출된 특징맵에 정규화(NORM)와 비선형함수(ex, PRELU 또는 RELU) 적용을 차례로 수행하는 제1 레이어(2133a)와, 상기 제1 레이어(2133a)의 아웃풋을 인풋으로 하여 컨볼루션 연산(CONV)을 수행하고 컨볼루션 연산으로 산출된 특징맵에 정규화(NORM)를 적용하여 제2 특징맵을 산출하는 제2 레이어(2133b)를 포함한다. 상기 제1 레이어(2133a)와 제2 레이어(2133b)는 직렬로 연결된다. The second parallel layer 2133: 2133a, 2133b _{performs a convolution operation (CONV) with the output of the first convolution layer 211: L 1} as an input, and normalizes the feature map calculated by the convolution operation (NORM) ) and a first layer 2133a that sequentially applies a nonlinear function (ex, PRELU or RELU), and a convolution operation (CONV) using the output of the first layer 2133a as an input and a second layer 2133b for calculating a second feature map by applying normalization (NORM) to the calculated feature map. The first layer 2133a and the second layer 2133b are connected in series.

퓨전 레이어(2135)는, 제1 특징맵과 제2 특징맵에 대해 퓨전연산(fusion, concentration)을 수행하여 하나의 맵을 생성한다. 실시예에 따라, 제1 특징맵과 제2 특징맵의 가로 및 세로 크기는 동일하나 각 특징맵 추출에 활용한 컨볼루션 연산의 수행 횟수와 아웃풋 크기가 다른 경우, 제1 특징맵과 제2 특징맵의 3차원 크기는 다를 수 있다. 컨볼루션 연산 수행 횟수가 다른 구조를 활용하여 얻어진 제1 특징맵과 제2 특징맵을 퓨전(fusion) 또는 결합(concentration)함으로써 하나의 영상 이미지를 다양한 측면에서의 특징 추출이 가능하다. 즉, 서로 다른 컨볼루션 연산(연산 수행 횟수, 가중치(weight))을 활용하여 특징을 추출하므로 하나의 인풋에 대해 다 측면 특징 추출이 가능하고 그 결과 손 제스처 분류 성능을 향상시킬 수 있다. 또한, 결합 전에 제1 특징맵과 제2 특징맵이 각각 정규화(normalization, NORM)되므로, 제1 특징맵과 제2 특징맵의 특성은 유지된다. The fusion layer 2135 generates one map by performing a fusion operation (fusion, concentration) on the first feature map and the second feature map. According to an embodiment, when the horizontal and vertical sizes of the first feature map and the second feature map are the same, but the number of times and output sizes of the convolution operation used for extracting each feature map are different, the first feature map and the second feature map The 3D size of the map may be different. By fusion or concentration of the first feature map and the second feature map obtained by using a structure in which the number of times of convolution operation is different, it is possible to extract features from a single video image in various aspects. That is, since features are extracted using different convolution operations (number of operations performed, weights), multi-faceted feature extraction is possible for one input, and as a result, hand gesture classification performance can be improved. In addition, since the first feature map and the second feature map are each normalized (NORM) before combining, the properties of the first feature map and the second feature map are maintained.

노이즈 감소 레이어(2137)는 제1 특징맵과 제2 특징맵의 퓨전으로 결합된 특징맵에 비선형함수(ex, PRELU) 적용을 수행한다. 노이즈 감소 레이어(2137)는 아웃풋을 제3 컨볼루션 레이어(215: L₅)로 전달하며, 제3 컨볼루션 레이어(215: L₅)에서 컨볼루션 연산 및 풀링을 수행한다. The noise reduction layer 2137 applies a nonlinear function (eg, PRELU) to a feature map combined by fusion of the first feature map and the second feature map. The noise reduction layer 2137 transmits the output to the third convolution layer 215 : L ₅ , and the third convolution layer 215 : L ₅ performs a convolution operation and pooling.

도 7은, 학습엔진(13)이 도 6에 도시된 구조로 설계된 컨볼루션 신경망을 통해 학습한 학습 데이터를 보여주며, 상기 학습데이터는 손 제스처 검출에 최적화된 파라미터로 구현된다. 7 shows learning data learned by the learning engine 13 through a convolutional neural network designed with the structure shown in FIG. 6 , and the learning data is implemented as parameters optimized for hand gesture detection.

도 2 및 도 5에 도시된 바와 같이, 일련의 컨볼루션 레이어들을 연이어 연결함으로써 컨볼루션 연산을 반복하여 수행하면, 정확도를 높이는 깊은 학습이 가능 한 반면, 연산의 양이 많아져 처리속도는 느려질 수 있다. 그러나, 도 2 및 도 5와 같은 학습구조에서 컨볼루션 연산 수행 횟수(학습 깊이), 즉 커널필터 및 특징맵의 수를 다소 축소하게 되면 처리속도는 빨라지나 정확하게 특징값을 추출하고 분류해 내는 학습이 불가능할 수 있으며 이후, 손 제츠처 분류의 정확도가 낮아질 수 있다. 따라서, 다른 실시예에 따라, 분류의 정확도를 많이 낮추지 않으면서도 처리속도는 향상시킬 수 있는 설계구조를 도 6에 도시하고, 이러한 설계구조를 통한 학습시 손 제스처를 정확도 높게 분류할 수 있는 파라미터를 도 7에 도시한다. 또한, 또 다른 실시예에 따라, 도 6에 도시된 다른 실시예에 따른 제2 컨볼루션 레이어(213)를 직렬로 복수 회 연결하여 설계된 컨볼루션 신경망을 도 8에 도시한다. As shown in FIGS. 2 and 5, if a convolution operation is repeatedly performed by connecting a series of convolution layers one after another, deep learning that increases accuracy is possible, while processing speed may be slowed due to an increase in the amount of operation. have. However, if the number of convolution operations performed (learning depth), that is, the number of kernel filters and feature maps, is somewhat reduced in the learning structure as shown in FIGS. This may not be possible, and then, the accuracy of hand gesture classification may be lowered. Accordingly, according to another embodiment, a design structure capable of improving processing speed without significantly lowering classification accuracy is shown in FIG. 6 , and parameters capable of classifying hand gestures with high accuracy when learning through this design structure are determined. 7 shows. In addition, according to another embodiment, a convolutional neural network designed by connecting the second convolutional layer 213 according to another embodiment shown in FIG. 6 in series a plurality of times is shown in FIG. 8 .

도 8을 참고하면, 또 다른 실시예에 따른 컨볼루션 신경망은, 도 6에 도시된 다른 실시예에 따른 제2 컨볼루션 레이어(213: L₂, L₃, L₄) 구조를 직렬로 복수 회 연결하며, 제1 컨볼루션 레이어(211: L₁) 및 제3 컨볼루션 레이어(215: L5)는 동일하게 구성한다. Referring to FIG. 8 , in the convolutional neural network according to another embodiment, the structure of the second convolutional layer 213 (L ₂ , L ₃ , L ₄ ) according to another embodiment shown in FIG. 6 is serially repeated a plurality of times. connected, the first convolutional layer 211 : L ₁ and the third convolutional layer 215 : L5 are configured in the same manner.

도 8에 도시된 컨볼루션 신경망은 제2 컨볼루션 레이어(213: L₂, L₃, L₄)가 복수회 반복되도록 구성되므로 비슷한 모양이나 특징을 갖는 다수의 대상(ex, 손 제스처)을 분류할 때 효과적이다. 즉, 제2 컨볼루션 레이어(213: L₂, L₃, L₄)가 한번 삽입된 경우보다 여러 번 삽입된 구조가 학습 및 분류를 처리하는데 소비되는 시간이 길어질 수 있으나 깊은 학습이 가능하므로 더 많은 종류의 대상을 분류하는 데 효과적이다. Since the convolutional neural network shown in FIG. 8 is configured such that the second convolutional layer 213: L ₂ , L ₃ , L ₄ is repeated multiple times, it classifies a plurality of objects (ex, hand gestures) having similar shapes or characteristics. effective when That is, a structure in which the second convolutional layer 213: L ₂ , L ₃ , L ₄ is inserted several times may take longer to process learning and classification than when the second convolutional layer 213 is inserted once, but since deep learning is possible, more It is effective in classifying many kinds of objects.

도 9는 실시예에 따라 컨볼루션 신경망(CNN)에 대한 학습에 활용되는 학습영상의 예시도이고, 도 10은 실시예에 따른 컨볼루션 신경망(CNN) 구조를 이용하는 손 제스처 분류기가 도출한 최종 분류결과의 예시도이다. 9 is an exemplary diagram of a learning image used for learning on a convolutional neural network (CNN) according to an embodiment, and FIG. 10 is a final classification derived by a hand gesture classifier using a convolutional neural network (CNN) structure according to an embodiment. This is an example of the result.

도 9에 도시된 영상은, 학습엔진(13)이 실시예에 따라 설계된 컨볼루션 신경망을 통해 학습시, 학습에 활용되는 인풋의 예시를 보여준다. 크기, 회전, 반전, 히스토그램, 평활화, 블러링, 감마변환, 밝기변화, 원근 왜곡 등 하나의 영상에 다양한 처리를 수행하여 변화를 준 다양한 영상을 활용하여 학습엔진(13)은 학습할 수 있다. The image shown in FIG. 9 shows an example of an input used for learning when the learning engine 13 learns through the convolutional neural network designed according to the embodiment. The learning engine 13 can learn by using various images that have been changed by performing various processing on one image, such as size, rotation, inversion, histogram, smoothing, blurring, gamma transformation, brightness change, and perspective distortion.

도 10을 참고하면, 실시예에 따른 컨볼루션 신경망(CNN) 구조를 이용하는 손 제스처 분류기가 손 영상 이미지를 분석한 결과를 보여준다. 도 10(a)를 참고하면, 실선 박스로 제스처 후보영역이 검출되고, 도 10(b)의 분석 테이블은 상기 분류기가 후보영역에 대한 N 개의 특징맵을 분석한 최종 결과를 도시한다. 도 10(b)의 분석 테이블을 검토하면, 분류기의 최종 분류결과에서 확률이 0.5 이상이면서 확률이 가장 높은 제스처를 최종 결과로 판단할 수 있다. 상기 후보영역에 대해 제스처 1이 99.89% 이상의 확률로 1순위이므로 최종결과는 제스처 1로 판단할 수 있다. 예를 들어, 제스처 1은 다양한 손 제스처(ex, 가위, 바위, 보) 중 '보'에 대응될 수 있다. Referring to FIG. 10 , a hand gesture classifier using a convolutional neural network (CNN) structure according to an embodiment shows a result of analyzing a hand image. Referring to FIG. 10(a) , a gesture candidate region is detected by a solid line box, and the analysis table of FIG. 10(b) shows the final result of the classifier analyzing N feature maps for the candidate region. Examining the analysis table of FIG. 10( b ), it is possible to determine a gesture having the highest probability and a probability of 0.5 or more in the final classification result of the classifier as the final result. With respect to the candidate region, since gesture 1 is ranked first with a probability of 99.89% or more, the final result can be determined as gesture 1. For example, gesture 1 may correspond to 'bo' among various hand gestures (eg, scissors, rock, and paper).

도 11은 또 다른 실시예에 따른 제스처를 이용한 기기 제어 시스템을 보여주는 전체 개념도이다.11 is an overall conceptual diagram illustrating a device control system using a gesture according to another embodiment.

도 11을 참고하면, 제스처를 이용한 기기 제어 시스템은 영상 입력장치(1), 모니터(2), 제스처 표시영역(3), 그리고 제스처 인식 장치(4)를 포함할 수 있다. Referring to FIG. 11 , a device control system using a gesture may include an image input device 1 , a monitor 2 , a gesture display area 3 , and a gesture recognition device 4 .

영상 입력장치(1)는 사용자의 손모양(hand shape) 또는 손동작(hand gesture)을 인식하기 위하여 검출영상을 획득한다. 예를 들어, 영상 입력장치(1)는 깊이 인식 카메라, 스테레오 카메라, 컬러 카메라로 구현될 수 있으며(ex, 키넥트(kinect) 카메라), 검출영상으로 동영상 및 정지영상을 획득할 수 있다. 검출영상이 동영상인 경우, 복수의 연속적인 프레임들로 구성될 수 있다. 또한, 검출영상은 컬러영상, 깊이영상 및 컬러-깊이(RGB-C) 영상을 포함할 수 있다. The image input device 1 acquires a detection image to recognize a user's hand shape or hand gesture. For example, the image input device 1 may be implemented as a depth recognition camera, a stereo camera, or a color camera (eg, a kinect camera), and may acquire a moving image and a still image as a detection image. When the detected image is a moving picture, it may be composed of a plurality of consecutive frames. Also, the detection image may include a color image, a depth image, and a color-depth (RGB-C) image.

모니터(2)는 제스처 표시영역(3) 상에서 사용자가 움직이는 손의 모양, 동작, 손의 위치에 대응되는 영상을 표시한다. 따라서, 사용자는 제스처 표시영역(3)을 벗어나지 않고 제스처 표시영역(3) 범위 내에서 사용자가 의도한 손모양 및 손동작이 제스처 인식 장치(4)로 전달되는지를 모니터(2)를 통해 확인할 수 있다. 만약, 사용자가 의도한 바와 상이한 손모양이나 손동작이 제스처 인식 장치(4)에 전달되어 모니터(2)에 표시되는 경우, 사용자는 손모양이나 손동작을 수정하여 다시 전달되도록 제스처 표시영역(3)에 표시할 수 있다. The monitor 2 displays an image corresponding to the shape, motion, and position of the hand that the user moves on the gesture display area 3 . Accordingly, the user can check through the monitor 2 whether the user's intended hand shape and hand gesture are transmitted to the gesture recognition device 4 within the range of the gesture display area 3 without departing from the gesture display area 3 . . If a hand shape or hand gesture different from that intended by the user is transmitted to the gesture recognition device 4 and displayed on the monitor 2, the user corrects the hand shape or hand action and displays it in the gesture display area 3 so that it is transmitted again. can be displayed

제스처 표시영역(3)은 사용자의 손모양이나 손동작에 대한 정보를 전달하는 영역으로 원격에서 사용자 정의로 생성된다. 실시예에 따라, 사용자는 고정된 위치가 아닌 스스로 지정한 위치(T)에서 원격으로 제어신호를 표시하는 임의의 영역을 정의하여 제스처 표시영역(3)을 생성할 수 있다. The gesture display area 3 is an area that transmits information about a user's hand shape or hand gesture, and is created remotely by user definition. According to an embodiment, the user may create the gesture display area 3 by defining an arbitrary area for remotely displaying a control signal at a location T designated by the user rather than a fixed location.

도 11을 참고하면, 모니터(2) 상에 4개의 모니터좌표(M1, M2, M3, M4)가 표시되면, 사용자는 스스로 지정한 원격위치(T)에서 상기 4개의 모니터좌표(M1, M2, M3, M4)를 따라 허공에 4개 지점에 손 영역 좌표(G)를 표시한다. 영상 입력장치(1)가 손 영역 좌표(G)가 표시된 영상을 획득하고, 제스처 인식 장치(4)가 영상 입력장치(1)에 의해 획득된 영상을 분석하여 기준좌표(B1, B2, B3, B4)를 도출할 수 있다. 일 실시예에 따라, 제스처 표시영역(3)은 상기 기준좌표(B1, B2, B3, B4)를 연결한 영역에 대응될 수 있다. 또한, 기준좌표(B1, B2, B3, B4)와 모니터좌표(M1, M2, M3, M4)는 서로 좌우가 반전되므로, 영상으로부터 획득한 4개 지점에 대한 손 영역 좌표(G)를 좌우 반전시켜 기준좌표(B1, B2, B3, B4)를 생성한다. 사용자가 스스로 지정한 원격위치(T)는 영상 입력장치(1) 및 모니터(2)로부터 일정거리(D) 떨어져 있는 지점으로 임계거리보다 작거나 크지 않는 위치이며, 동시에 영상 입력장치(1)가 영상을 획득할 수 있는 화각 범위를 벗어나지 않는 위치로 정의한다. Referring to FIG. 11 , when the four monitor coordinates M1, M2, M3, and M4 are displayed on the monitor 2, the user selects the four monitor coordinates M1, M2, M3 at the remote location T designated by the user. , mark the hand region coordinates (G) at four points in the air along M4). The image input device 1 acquires an image in which hand region coordinates G are displayed, and the gesture recognition device 4 analyzes the image acquired by the image input device 1 to obtain reference coordinates B1, B2, B3, B4) can be derived. According to an embodiment, the gesture display area 3 may correspond to an area connected to the reference coordinates B1, B2, B3, and B4. In addition, since the reference coordinates (B1, B2, B3, B4) and the monitor coordinates (M1, M2, M3, M4) are inverted left and right, the hand area coordinates (G) for the four points obtained from the image are inverted left and right to create the reference coordinates (B1, B2, B3, B4). The remote location (T) designated by the user is a point that is a certain distance (D) away from the image input device 1 and the monitor 2, and is a location that is not smaller or greater than the threshold distance, and at the same time that the image input device 1 It is defined as a position that does not deviate from the range of angle of view that can be obtained.

제스처 인식 장치(4)는 영상 입력장치(1)가 획득한 영상에서 사용자의 손모양, 손동작 및 이들의 다양한 조합을 분석하고, 분석결과에 따라 각종 디바이스를 제어한다. 실시예에 따라, 제스처 인식 장치(4)는 기학습 된 손모양 및 손동작 분류기를 이용하여 손모양, 손동작 및 이들의 다양한 조합을 정확하게 인식할 수 있고, 상기 분류기는 앞서 설명한, 다양한 실시예에 따른 컨볼루션 신경망(CNN)을 이용하여 구현될 수 있다. 따라서, 상기 분류기는 정확성 높고 분석속도도 빠른 손 제스처 분류기로 실현될 수 있다. The gesture recognition apparatus 4 analyzes the user's hand shape, hand gesture, and various combinations thereof from the image acquired by the image input apparatus 1 , and controls various devices according to the analysis result. According to an embodiment, the gesture recognition apparatus 4 can accurately recognize a hand shape, a hand motion, and various combinations thereof using the previously learned hand shape and hand motion classifier, and the classifier according to various embodiments described above. It may be implemented using a convolutional neural network (CNN). Therefore, the classifier can be realized as a hand gesture classifier with high accuracy and high analysis speed.

한편, 이하 내용에서는, 사용자 신체의 일부인 손동작(제스처, gesture)을 예를 들어 서술하나, 기타 얼굴, 팔, 기타 다양한 신체의 일부에 대한 모양 또는 동작을 배제하는 것은 아니다. 또한, 이하에서 서술되는 손 제스처(gesture)는 손 동작 자체만을 지칭하는 것은 아니며, 손모양까지 포함하는 것으로 정의한다. Meanwhile, in the following description, a hand gesture (gesture) that is a part of the user's body is described as an example, but the shape or motion of other parts of the face, arm, or other various body parts is not excluded. In addition, the hand gesture described below does not refer only to the hand motion itself, but is defined as including the hand shape.

도 12는 도 11의 제스처 인식 장치를 설명하는 블럭도이며, 도 13은 실시예에 따라 제스처 검출의 예시를 보여주는 도면이다. 12 is a block diagram illustrating the gesture recognition apparatus of FIG. 11 , and FIG. 13 is a diagram illustrating an example of gesture detection according to an embodiment.

도 12를 참고하면, 제스처 인식 장치(4)는 제스처 사용 검증부(41), 제스처 등록부(43), 그리고 기기 제어부(45)를 포함할 수 있으며, 이러한 구성요소는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합을 통해서 구현될 수 있다. 또한, 제스처 인식 장치(4)는 메모리와 하나 이상의 프로세서를 포함할 수 있으며, 제스처 사용 검증부(41), 제스처 등록부(43), 그리고 기기 제어부(45)의 기능은 상기 메모리에 저장되어, 상기 하나 이상의 프로세서에 의하여 실행되는 프로그램 형태로 제스처 인식 장치(4)에 구현될 수 있다.Referring to FIG. 12 , the gesture recognition apparatus 4 may include a gesture use verification unit 41 , a gesture registration unit 43 , and a device control unit 45 , and these components may be implemented as hardware or software, or hardware It can be implemented through the combination of and software. In addition, the gesture recognition device 4 may include a memory and one or more processors, and the functions of the gesture use verification unit 41 , the gesture registration unit 43 , and the device control unit 45 are stored in the memory, and the It may be implemented in the gesture recognition device 4 in the form of a program executed by one or more processors.

제스처 사용 검증부(41)는, 도 11 및 도 12를 참고하면, 사용자가 임의로 지정한 원격위치(T)에서 제스처를 이용하여 주변 기기를 제어할 수 있는지 검증한다. 또한, 제스처 사용 검증부(41)는, 모니터(2)와 제스처 표시영역(3) 상호 간 대응관계를 나타내는 변환행렬을 연산한다. The gesture use verification unit 41 verifies whether a peripheral device can be controlled using a gesture at a remote location T arbitrarily designated by the user, referring to FIGS. 11 and 12 . In addition, the gesture use verification unit 41 calculates a transformation matrix indicating the correspondence between the monitor 2 and the gesture display area 3 .

원격위치(T)는 모니터(2)의 정면의 일정 영역 내에 임계 거리를 벗어나지 않는 위치로 정의할 수 있다. 따라서, 제스처 사용 검증부(41)는, 사용자의 현재위치가 상기 정의된 원격위치(T)를 만족하지 못한 경우, 예를 들어, 현재위치가 임계 거리를 벗어나거나, 영상 입력장치(1)가 영상을 획득할 수 있는 각도를 벗어나는 경우는 현재 위치에서는 제어할 수 없다고 판단할 수 있다. 이하, 설명은 사용자의 현재위치가 정의된 원격위치에 해당하여 제어가능 여부가 검증된 것을 전제로 한다. The remote location T may be defined as a location that does not deviate from a threshold distance within a predetermined area of the front of the monitor 2 . Accordingly, when the user's current location does not satisfy the defined remote location (T), the gesture use verification unit 41, for example, the current location is out of the threshold distance, or the image input device 1 is If it is out of an angle at which an image can be acquired, it may be determined that the current position cannot be controlled. Hereinafter, the description assumes that the user's current location corresponds to a defined remote location and that controllability is verified.

제스처 사용 검증부(41)는, 도 11 및 도 12를 참고하면, 모니터(2) 상에 4개의 모니터좌표(M1, M2, M3, M4)를 표시하여 사용자의 제스처를 유도한다. 사용자가 스스로 지정한 원격위치(T)에서 상기 4개의 모니터좌표(M1, M2, M3, M4)를 따라 허공의 4개 지점에 좌표(G)를 표시하면, 제스처 사용 검증부(41)는 영상 입력장치(1)가 획득한 손 영역 좌표(G)에 대한 영상을 분석하여 모니터(2)와 제스처 표시영역(3) 상호 간의 대응관계를 도출한다. 모니터(2)와 제스처 표시영역(3) 상호 간의 대응관계는, 모니터좌표(M1, M2, M3, M4)와 제스처 표시영역(3)의 기준좌표(B1, B2, B3, B4) 사이의 변환형렬(T)로 실현될 수 있으나 이에 한정되는 것은 아니다. 여기서, 변환형렬(T)은 원격위치(T)에서 손 제스처의 움직임에 대한 좌표가 모니터(2)상 좌표로 구현될 수 있도록 한다. 예를 들어, 4개의 모니터좌표(M1, M2, M3, M4)에 대한 행렬(M)과 검출된 기준좌표(B1, B2, B3, B4)에 대한 행렬(B)로 두면, M= T ×B식을 도출하고, 산술적 연산으로부터 변환형렬(T)은 T= M ×B^- ¹로부터 도출할 수 있다. The gesture use verification unit 41 induces a user's gesture by displaying four monitor coordinates M1, M2, M3, and M4 on the monitor 2 with reference to FIGS. 11 and 12 . When the coordinates (G) are displayed at four points in the air along the four monitor coordinates (M1, M2, M3, M4) at the remote location (T) designated by the user, the gesture use verification unit 41 enters the image An image of the hand region coordinates G obtained by the device 1 is analyzed to derive a correspondence between the monitor 2 and the gesture display region 3 . The correspondence between the monitor 2 and the gesture display area 3 is a transformation between the monitor coordinates M1, M2, M3, M4 and the reference coordinates B1, B2, B3, B4 of the gesture display area 3 It may be realized in the form (T), but is not limited thereto. Here, the transformation matrix T enables the coordinates for the movement of the hand gesture at the remote location T to be implemented as coordinates on the monitor 2 . For example, if we put a matrix (M) for four monitor coordinates (M1, M2, M3, M4) and a matrix (B) for the detected reference coordinates (B1, B2, B3, B4), M = T × Equation B is derived, and the transformation matrix (T) can be derived from ^{T = M × B -} ^{1 from the arithmetic operation.}

제스처 등록부(43)는, 제스처 표시영역(3) 상에서 제어신호로 사용될 제스처와 제스처에 의해 제어될 기기를 사용자의 선택에 기초하여 등록할 수 있다. 이때, 사용자는 하나의 기기를 제어하는 제스처 종류를 다르게 선택함으로써, 기기를 다양한 제어신호로 제어할 수 있다. The gesture registration unit 43 may register a gesture to be used as a control signal on the gesture display area 3 and a device to be controlled by the gesture based on a user's selection. In this case, the user can control the device with various control signals by selecting different types of gestures for controlling one device.

기기 제어부(45)는, 제스처 사용 검증부(41)에서 설정된 기준좌표(B1, B2, B3, B4)의 영역, 즉 제스처 표시영역(3)에서 사용자에 의해 표시된 제스처 영상을 영상 입력장치(1)를 통해 전달받아 제스처를 검출하고 분류할 수 있다. 또한, 기기 제어부(45)는, 분류된 제스처에 따라 기기를 제어할 수 있다. 사용자는 제스처를 취하여 이벤트를 발생시킬 수 있고, 여러 모양의 제스처 조합을 통해, On, Off, 소리 재생, 볼률 조절 등과 같은 제어신호를 생성할 수 있고, 제어할 기기를 원격으로 제어할 수 있다. The device control unit 45 transmits the gesture image displayed by the user in the region of the reference coordinates B1, B2, B3, and B4 set by the gesture use verification unit 41, that is, the gesture display region 3 to the image input device 1 ) to detect and classify gestures. Also, the device controller 45 may control the device according to the classified gesture. A user can generate an event by taking a gesture, and through a combination of gestures of various shapes, control signals such as On, Off, sound reproduction, volume control, etc. can be generated, and a device to be controlled can be remotely controlled.

실시예에 따라, 기기 제어부(45)는, 앞서 설명한 다양한 실시예에 따른 컨볼루션 신경망(CNN)을 이용하여 검출기를 구현할 수 있다. 영상 입력장치(1)가 획득한 영상에서 기기 제어부(45)는 상기 검출기(분류기)를 이용하여 손 제스처를 정밀도 높게 검출하고 분류할 수 있다. According to an embodiment, the device controller 45 may implement a detector using a convolutional neural network (CNN) according to various embodiments described above. From the image acquired by the image input device 1 , the device controller 45 can detect and classify hand gestures with high precision by using the detector (classifier).

도 12를 참고하면, 기기 제어부(45)는 영상 입력장치(1)가 획득한 전체 영상에서 제스처 표시영역(3) 즉, 기준좌표(B1, B2, B3, B4)의 영역 내의 영상에서 움직임이 있는 부분을 탐색한다. 기기 제어부(45)는 이전 프레임과 현재 프레임의 차를 활용한 광류장(dense optical flow)을 기반으로 영상에서 움직임이 있는 부분을 탐색할 수 있으며, 광류장 기반의 모션 벡터를 추출하는 알고리즘(Lucas-Kanade 또는 Gunnar Farneback)을 이용할 수 있다. Referring to FIG. 12 , the device controller 45 detects motion in the image within the gesture display area 3 , that is, the reference coordinates B1 , B2 , B3 and B4 in the entire image acquired by the image input device 1 . explore the part The device control unit 45 may search for a motion part in the image based on a dense optical flow using the difference between the previous frame and the current frame, and an algorithm (Lucas) for extracting a motion vector based on the optical flow field. -Kanade or Gunnar Farneback) can be used.

보다 상세하게, 기기 제어부(45)는 움직임이 있는 블록 내에서 모션벡터의 크기(magnitude)와 각도(angle)을 추출하고, 임의의 블록 내에서 모션벡터의 크기를 파악하여 임계치보다 큰 것들의 개수가 일정한 값보다 클 경우에 움직임이 있는 블록으로 파악을 한다. 기기 제어부(45)는 움직임이 진행되다가 멈춤이 있는 블록으로 판단될 경우에는 검출할 대상이 포함된 영역으로 간주한다. 실시예에 따라, 기기 제어부(45)는 검출 대상이 포함된 후보 영역을 선택함에 있어, 모션이 멈추는 블록 중 최상단의 블록 또는 최상단 모션벡터 좌표를 기준으로 검출 대상이 포함된 영역으로 보아 검출을 수행할 수 있다. 이는, 서있는 자세에서는 팔 및 손의 움직임 특성상 최상단에 위치한 블록 영역에서 손이 위치하게 되는 특성을 반영한 것이다. 기기 제어부(45)는 검출 대상이 포함된 것으로 판단된 블록 영역에서 특징을 추출하거나, 블록 영역에서 슬라이딩 윈도(sliding window)방법을 활용하여 후보 영역을 검출한 후 분류 방법을 적용하여 제스처를 인식할 수 있다. In more detail, the device control unit 45 extracts the magnitude and angle of the motion vector from the block in which there is motion, and the number of the motion vectors larger than the threshold by identifying the magnitude of the motion vector within an arbitrary block. If is greater than a certain value, it is identified as a moving block. When it is determined that the block has a stop while the movement is in progress, the device controller 45 regards the area as an area including a target to be detected. According to an exemplary embodiment, when selecting a candidate region including a detection target, the device controller 45 performs detection as a region including a detection target based on the topmost block or topmost motion vector coordinates among blocks where motion is stopped. can do. This reflects the characteristic that the hand is positioned in the uppermost block area due to the movement characteristics of the arm and hand in the standing posture. The device control unit 45 extracts a feature from a block area determined to contain a detection target, or detects a candidate area using a sliding window method in the block area, and then applies a classification method to recognize a gesture. can

도 13을 참고하면, 테이블(451)은 모션 블록의 좌표와 광류장 모션벡터 값의 예를 보여준다. 예를 들어, 10×10 당 1개의 특징 포인트를 추출할 수 있으며, 모션벡터의 크기(magnitude)가 4.5 이상이면, 움직임이 있는 포이트라고 판단할 수 있다. 화면(453)은 모션의 변화가 있는 부분관 검출대상 블럭 및 검출된 손 영역의 예시를 보여준다. Referring to FIG. 13 , a table 451 shows an example of motion block coordinates and optical flow field motion vector values. For example, one feature point can be extracted per 10×10, and when the magnitude of the motion vector is 4.5 or more, it can be determined that the point has a movement. The screen 453 shows an example of a partial tube detection target block and a detected hand region with a change in motion.

본 명세서는 많은 특징을 포함하는 반면, 그러한 특징은 본 발명의 범위 또는 특허청구범위를 제한하는 것으로 해석되어서는 안 된다. 또한, 본 명세서에서 개별적인 실시예에서 설명된 특징들은 단일 실시예에서 결합되어 구현될 수 있다. 반대로, 본 명세서에서 단일 실시예에서 설명된 다양한 특징들은 개별적으로 다양한 실시예에서 구현되거나, 적절히 결합되어 구현될 수 있다.While this specification contains many features, such features should not be construed as limiting the scope of the invention or the claims. Also, features described in individual embodiments herein may be implemented in combination in a single embodiment. Conversely, various features described herein in a single embodiment may be implemented in various embodiments individually, or may be implemented in appropriate combination.

도면에서 동작들이 특정한 순서로 설명되었으나, 그러한 동작들이 도시된 바와 같은 특정한 순서로 수행되는 것으로, 또는 일련의 연속된 순서, 또는 원하는 결과를 얻기 위해 모든 설명된 동작이 수행되는 것으로 이해되어서는 안 된다. 특정 환경에서 멀티태스킹 및 병렬 프로세싱이 유리할 수 있다. 아울러, 상술한 실시예에서 다양한 시스템 구성요소의 구분은 모든 실시예에서 그러한 구분을 요구하지 않는 것으로 이해되어야 한다. 상술한 프로그램 구성요소 및 시스템은 일반적으로 단일 소프트웨어 제품 또는 멀티플 소프트웨어 제품에 패키지로 구현될 수 있다.Although acts are described in a particular order in the drawings, it should not be understood that the acts are performed in the particular order as shown, or that all of the described acts are performed in a continuous order, or to obtain a desired result. . Multitasking and parallel processing can be advantageous in certain circumstances. In addition, it should be understood that the division of various system components in the above-described embodiments does not require such division in all embodiments. The program components and systems described above may generally be implemented as a package in a single software product or multiple software products.

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 형태로 기록매체(시디롬, 램, 롬, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다. 이러한 과정은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있으므로 더 이상 상세히 설명하지 않기로 한다.The method of the present invention as described above may be implemented as a program and stored in a computer-readable form in a recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.). Since this process can be easily performed by a person skilled in the art to which the present invention pertains, it will not be described in detail any longer.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다.The present invention described above, for those of ordinary skill in the art to which the present invention pertains, various substitutions, modifications and changes are possible within the scope without departing from the technical spirit of the present invention. It is not limited by the drawings.

2: 컨볼루션 신경망 21: 컨볼루션 레이어
211: 제1 컨볼루션 레이어 213: 제2 컨볼루션 레이어
215: 제3 컨볼루션 레이어 23: 완전 연결 레이어2: Convolutional Neural Network 21: Convolutional Layer
211: first convolutional layer 213: second convolutional layer
215: third convolutional layer 23: fully connected layer

Claims

A gesture classifier for learning parameters of a hand gesture detection convolutional neural network, comprising:
a convolutional neural network comprising a plurality of convolutional layers for calculating a feature map by performing a convolution operation and a fully connected layer for classifying a detected image by analyzing the feature maps calculated from the plurality of convolutional layers; and
and a learning engine that trains the convolutional neural network to calculate parameters optimized for hand gesture detection.
The plurality of convolution layers may include: a first convolution layer including a sub-sampling layer that reduces a size of a feature map calculated as a result of a convolution operation based on a detection image; a second convolutional layer implemented as a non-subsampling layer so as to maintain the same size of the feature map and repeating a convolution operation based on an output of the first convolutional layer; A third convolution layer including a sub-sampling layer that reduces the size of the feature map calculated as a result of a convolution operation based on the output of the second convolution layer;
The type, number, and size of filters performing a convolution operation are independently configured for each of the plurality of convolutional layers, and the types, number, and size of filters included in the plurality of convolutional layers are individually determined by learning of the learning engine. Gesture classifier, characterized in that calculated.

According to claim 1,
The first convolution layer and the third convolution layer perform padding together so that the size of the output remains the same as the size of the input when the convolution operation is performed,
The parameter of the padding is independently configured for each of the plurality of convolutional layers, and the size of the parameter of the padding is individually calculated for each of the plurality of convolutional layers by learning of the learning engine.

3. The method of claim 2,
The first convolutional layer and the third convolutional layer are
A gesture classifier characterized in that the output obtained by sequentially applying the normalization and the nonlinear function to the feature map calculated by the convolution operation is transmitted to the sub-sampling layer.

4. The method of claim 3,
The second convolutional layer is
A plurality of convolutions in which padding is performed so that the size of the output remains the same as the size of the input when the convolution operation is performed, and normalization and application of a nonlinear function are sequentially performed on the feature map calculated by the convolution operation to calculate the output contains layers,
A plurality of convolutional layers constituting the second convolutional layer are configured by serial combination.

5. The method of claim 4,
The number of the plurality of convolutional layers constituting the second convolutional layer is,
Gesture classifier, characterized in that calculated by the learning of the learning engine.

4. The method of claim 3,
The second convolutional layer is
a first parallel layer for calculating a first feature map by performing normalization on a feature map calculated by a convolution operation on the output of the first convolutional layer;
A first layer that sequentially applies normalization and a nonlinear function to a feature map calculated by a convolution operation on the output of the first convolution layer, and a feature map calculated by a convolution operation on the output of the first layer a second parallel layer comprising a second layer for calculating a second feature map by applying normalization, wherein the first layer and the second layer are serially combined;
a fusion layer for performing a sum operation on the first feature map and the second feature map; and
and a noise reduction layer that applies a nonlinear function to the output of the fusion layer.

7. The method of claim 6,
The second layer performs padding together so that the size of the output remains the same as the size of the input when the convolution operation is performed,
The padding parameter is a gesture classifier, characterized in that the size of the padding parameter is calculated individually for each of the plurality of convolutional layers by learning of the learning engine.

delete