KR20220153929A

KR20220153929A - Deep learning method and apparatus based on tsn for real-time hand gesture recognition in video

Info

Publication number: KR20220153929A
Application number: KR1020210061541A
Authority: KR
Inventors: 신병석; 조종경; 이연
Original assignee: 인하대학교 산학협력단
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2022-11-21
Also published as: KR102587234B1

Abstract

Disclosed are a method and an apparatus for temporal segment network (TSN)-based deep learning for real-time hand gesture recognition in a video. According to an embodiment, a method for TSN-based deep learning for real-time hand gesture recognition in video is performed by a computer device. The method comprises the steps of: segmenting the entire video according to a time series image or a spatial image in a TSN for hand gesture recognition; performing deep learning on the time series image and deep learning on the spatial image, respectively; integrating the time series images on which the deep learning is performed and the spatial images on which the deep learning is performed; and performing final classification using the integrated spatial images and the integrated time series images. The present invention recognizes motions in which several 2D actions are connected in the 3D video and recognizes hand gestures in real time based on the deep learning.

Description

TSN-based deep learning method and apparatus for real-time hand gesture recognition in video

아래의 실시예들은 손동작 인식 방법 및 장치에 관한 것으로, 보다 구체적으로 동영상 중 실시간 손동작 인식을 위한 TSN 기반의 딥러닝 방법 및 장치에 관한 것이다. The following embodiments relate to a hand gesture recognition method and apparatus, and more specifically, to a TSN-based deep learning method and apparatus for real-time hand gesture recognition in a video.

모바일 폰은 삶의 큰 부분을 차지하며, 모바일 폰에서 생성되는 인터넷 트래픽의 비율이 데스크탑을 능가하고 있다. 이러한 전환으로 인해 인간이 이러한 기기들과 상호작용하는 방식도 키보드/마우스에서 제스처, 음성 및 뇌 컴퓨터 인터페이스로 진화하여 왔다. 예컨대 음성 인터페이스의 경우 시끄러운 실외 환경이나 조용히 해야 하는 환경에서는 음성 인터페이스가 정확하지 못하거나 사용하기 어려운 문제점이 있으므로, 현재 손 제스처 인터페이스(hand gestural interface)와 음성의 조합이 사용되고 있다. Mobile phones are a huge part of our lives, and the percentage of Internet traffic generated by mobile phones is surpassing that of desktops. Because of this shift, the way humans interact with these devices has also evolved from keyboard/mouse to gestures, voice and brain computer interfaces. For example, in the case of a voice interface, a combination of a hand gestural interface and voice is currently used because the voice interface is inaccurate or difficult to use in a noisy outdoor environment or in a quiet environment.

또한, 다양한 손 제스처의 인식을 프로그램의 입력으로 사용하여 키보드나 패널 터치의 입력을 대체하려는 수요가 늘어나고 있다. 예를 들어 주변 환경이 복잡한 자율주행자동차의 운행 과정에 손 제스처 인식을 자율 주행의 명령으로 대체하여 사용할 수 있다면 버튼을 누르는 등 액션을 피할 수 있으므로 사고의 위험을 줄일 수 있다.In addition, there is an increasing demand to replace keyboard or panel touch input by using recognition of various hand gestures as program inputs. For example, if hand gesture recognition can be replaced with a command for autonomous driving in the course of driving an autonomous vehicle with a complex surrounding environment, the risk of an accident can be reduced because actions such as pressing a button can be avoided.

기존의 TSN(Temporal Segment Networks) 기법은 전체 동영상을 입력으로 할 때 해당 동영상의 주제를 출력으로 하는데, 이러한 기법은 실시간으로 손 제스처를 인식하고 이를 명령어로 사용하는 응용에 적합하지 않다. 그러므로 시간 구간별로 손 제스처를 인식하여 출력할 수 있는 기법이 필요하다.Existing TSN (Temporal Segment Networks) techniques take the entire video as an input and output the subject of the video, but these techniques are not suitable for applications that recognize hand gestures in real time and use them as commands. Therefore, a technique capable of recognizing and outputting hand gestures for each time interval is required.

한국등록특허 10-2121654호는 이러한 딥러닝 기반 제스처 자동 인식 방법 및 시스템에 관한 기술을 기재하고 있다.Korean Patent Registration No. 10-2121654 describes a deep learning-based automatic gesture recognition method and system.

한국등록특허 10-2121654호Korean Patent Registration No. 10-2121654

실시예들은 동영상 중 실시간 손동작 인식을 위한 TSN 기반의 딥러닝 방법 및 장치에 관하여 기술하며, 보다 구체적으로 단순 2차원 이미지의 손 제스처 인식이 여러 장의 2D 액션이 연결된 모션을 인지하는 3차원 동영상에서의 손동작 인식 기법인 TSN(Temporal Segment Networks) 기반의 실시간 딥러닝 기술을 제공한다. Embodiments describe a TSN-based deep learning method and apparatus for real-time hand motion recognition in a video. It provides real-time deep learning technology based on TSN (Temporal Segment Networks), a hand gesture recognition technique.

실시예들은 3차원 동영상에서 여러 장의 2D 액션이 연결된 모션을 인지하고 딥러닝 기반으로 실시간으로 손동작을 인식할 수 있는, 동영상 중 실시간 손동작 인식을 위한 TSN 기반의 딥러닝 방법 및 장치를 제공하는데 있다. Embodiments are to provide a TSN-based deep learning method and apparatus for recognizing hand gestures in real time in a video, capable of recognizing motions in which several 2D actions are connected in a 3D video and recognizing hand gestures in real time based on deep learning.

일 실시예에 따른 컴퓨터 장치에 의해 수행되는 동영상 중 실시간 손동작 인식을 위한 TSN(Temporal Segment Networks) 기반의 딥러닝 방법은, 손동작 인식을 위한 TSN에서 전체 동영상을 시계열 영상 또는 공간 영상에 따라 구역화하는 단계; 상기 시계열 영상에 대한 딥러닝 및 상기 공간 영상에 대한 딥러닝을 각각 수행하는 단계; 딥러닝이 수행된 시계열 영상들을 통합하고, 딥러닝이 수행된 공간 영상들을 통합하는 단계; 및 통합된 상기 시계열 영상들과 통합된 상기 공간 영상들을 이용하여 최종 분류하는 단계를 포함하여 이루어질 수 있다. A deep learning method based on Temporal Segment Networks (TSN) for real-time hand gesture recognition among videos performed by a computer device according to an embodiment includes the steps of zoning an entire video according to a time-series image or a spatial image in the TSN for hand gesture recognition. ; performing deep learning on the time series image and deep learning on the spatial image, respectively; integrating time-series images on which deep learning is performed and integrating spatial images on which deep learning is performed; and final classification using the integrated spatial images with the integrated time-series images.

상기 구역화하는 단계는, 동영상을 복수개의 짧은 정보(short snippet)들로 나누는 단계; 및 각각의 상기 짧은 정보들을 시계열 영상 또는 공간 영상에 따라 구역화하는 단계를 포함하여 이루어질 수 있다. The zoning may include dividing a video into a plurality of short snippets; and zoning each of the short pieces of information according to a time-series image or a spatial image.

상기 최종 분류하는 단계는, 통합된 상기 시계열 영상들과 통합된 상기 공간 영상들을 이용하여 기저장된 정보들을 이용하여 상기 동영상의 손동작을 포함하고 있는지 분류할 수 있다. In the final classification step, it is possible to classify whether the video includes a hand gesture using information pre-stored using the integrated time-series images and the integrated spatial images.

다른 실시예에 따른 동영상 중 실시간 손동작 인식을 위한 TSN(Temporal Segment Networks) 기반의 딥러닝 장치는, 손동작 인식을 위한 TSN에서 전체 동영상을 시계열 영상 또는 공간 영상에 따라 구역화하는 영상 구역화부; 상기 시계열 영상에 대한 딥러닝 및 상기 공간 영상에 대한 딥러닝을 각각 수행하는 영상 딥러닝부; 딥러닝이 수행된 시계열 영상들을 통합하고, 딥러닝이 수행된 공간 영상들을 통합하는 영상 통합부; 및 통합된 상기 시계열 영상들과 통합된 상기 공간 영상들을 이용하여 최종 분류하는 손동작 인식부를 포함하여 이루어질 수 있다. A deep learning apparatus based on Temporal Segment Networks (TSN) for real-time hand gesture recognition among videos according to another embodiment includes an image segmentation unit that zones an entire video according to a time-series image or a spatial image in the TSN for hand gesture recognition; an image deep learning unit that performs deep learning on the time series image and deep learning on the spatial image, respectively; an image integration unit that integrates time-series images on which deep learning has been performed and spatial images on which deep learning has been performed; and a hand gesture recognizing unit that finally classifies the spatial images integrated with the integrated time-series images.

상기 영상 구역화부는, 동영상을 복수개의 짧은 정보(short snippet)들로 나누는 제1 구역화부; 및 각각의 상기 짧은 정보들을 시계열 영상 또는 공간 영상에 따라 구역화하는 제2 구역화부를 포함할 수 있다. The video segmentation unit may include a first segmentation unit that divides the video into a plurality of short snippets; and a second zoning unit for zoning each of the short pieces of information according to a time-series image or a spatial image.

실시예들에 따르면 3차원 동영상에서 여러 장의 2D 액션이 연결된 모션을 인지하고 딥러닝 기반으로 실시간으로 손동작을 인식할 수 있는, 동영상 중 실시간 손동작 인식을 위한 TSN 기반의 딥러닝 방법 및 장치를 제공할 수 있다. According to the embodiments, it is possible to provide a TSN-based deep learning method and apparatus for recognizing hand gestures in real time in a video, which can recognize motions in which several 2D actions are connected in a 3D video and recognize hand gestures in real time based on deep learning. can

도 1a은 일 실시예에 따른 실시간 손동작 인식을 위한 TSN 구조를 개략적으로 나타내는 도면이다.
도 1b는 일 실시예에 따른 2 스트림의 실시간 손동작 인식을 위한 TSN 구조의 예시를 나타내는 도면이다.
도 2는 일 실시예에 따른 2차원 모듈을 설명하기 위한 모듈이다.
도 3은 일 실시예에 따른 3차원 모듈을 설명하기 위한 도면이다.
도 4는 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 방법을 나타내는 흐름도이다.
도 5는 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 장치를 나타내는 블록도이다.1A is a diagram schematically illustrating a TSN structure for real-time hand gesture recognition according to an embodiment.
1B is a diagram illustrating an example of a TSN structure for recognizing two streams of hand gestures in real time according to an embodiment.
2 is a module for explaining a 2D module according to an embodiment.
3 is a diagram for explaining a 3D module according to an exemplary embodiment.
4 is a flowchart illustrating a TSN-based deep learning method for hand gesture recognition according to an embodiment.
5 is a block diagram illustrating a TSN-based deep learning device for hand gesture recognition according to an embodiment.

이하, 첨부된 도면을 참조하여 실시예들을 설명한다. 그러나, 기술되는 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명되는 실시예들에 의하여 한정되는 것은 아니다. 또한, 여러 실시예들은 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다. 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, embodiments will be described with reference to the accompanying drawings. However, the described embodiments may be modified in many different forms, and the scope of the present invention is not limited by the embodiments described below. In addition, several embodiments are provided to more completely explain the present invention to those skilled in the art. The shapes and sizes of elements in the drawings may be exaggerated for clarity.

아래의 실시예들은 동영상 중 실시간 손동작 인식을 위한 TSN 기반의 딥러닝 방법 및 장치에 관한 것으로, 단순 2차원 이미지의 손 제스처 인식이 여러 장의 2D 액션이 연결된 모션을 인지하는 3차원 동영상에서의 손동작 인식 기법인 TSN(Temporal Segment Networks) 기반의 실시간 딥러닝 기술을 제안한다.The following embodiments relate to a deep learning method and apparatus based on TSN for real-time hand gesture recognition in a video, and hand gesture recognition in a 3D video in which hand gesture recognition of a simple 2D image recognizes a motion in which several 2D actions are connected. We propose a real-time deep learning technology based on TSN (Temporal Segment Networks), which is a technique.

도 1a은 일 실시예에 따른 실시간 손동작 인식을 위한 TSN 구조를 개략적으로 나타내는 도면이다.1A is a diagram schematically illustrating a TSN structure for real-time hand gesture recognition according to an embodiment.

도 1a을 참조하면, 손동작(핸드모션) 인식을 위한 TSN(130)에서는 전체 동영상(110)을 구역화하고 시계열 영상에 대한 CNN 및 동일 공간 영상에 대한 CNN을 개별로 진행하고 각자 통합(140)하여 최종 분류(150)하는 과정을 거치게 된다. 여기서, 전체 동영상(110)은 복수개의 짧은 정보(short snippet)들(120)로 나눈 후 시계열 영상 및 동일 공간 영상으로 구역화할 수 있다.Referring to FIG. 1A, in the TSN 130 for recognizing hand motion (hand motion), the entire video 110 is zoned, CNNs for time-series images and CNNs for the same spatial image are separately performed, and each is integrated (140). It goes through the process of final classification (150). Here, the entire video 110 can be divided into a plurality of short snippets 120 and then segmented into a time series image and the same spatial image.

예를 들어, 영상 V를 m 개의 구역으로 나눌 경우 {S₁, S₂, ···, S_m}으로 구분될 수 있다. 이 때, TSN은 다음 식과 같이 나타낼 수 있다. For example, when the image V is divided into m regions, it may be divided into {S ₁ , S ₂ , ..., S _m }. At this time, TSN can be expressed as the following equation.

[식 1][Equation 1]

여기서, T _m 은 m 개의 구역 중 임의로 샘플링된 구역이고, F( T _m ; W)은 T _m 을 입력하고 W를 파라미터로 하는 CNN 딥러닝의 함수이고 분류 결과를 출력으로 한다. G는 구역화 통합 함수이며, 이는 모든 구역의 결과값을 통합하는 역할을 한다. H는 해당 동영상이 어떤 손 제스처를 포함하고 있는지 분류하는 함수이다.Here, T _m is a randomly sampled area among m areas, and F( T _m ; W) is a CNN deep learning function that takes T _m as an input and W as a parameter, and outputs the classification result. G is the zoning integration function, which serves to integrate the results of all zoning. H is a function that classifies which hand gesture the video contains.

도 1b는 일 실시예에 따른 2 스트림의 실시간 손동작 인식을 위한 TSN 구조의 예시를 나타내는 도면이다.1B is a diagram illustrating an example of a TSN structure for recognizing two streams of hand gestures in real time according to an embodiment.

도 1b를 참조하면, 2 스트림 손동작 인식 네트워크(상단)의 프레임워크와 팽창된 3D ResNet50 (하단)의 구조를 나타낸다. 확장된 3차원 모듈의 예시로 3D ResNets 기반의 2 스트림 손동작 인식 네트워크를 제안할 수 있다. 네트워크 프레임 워크에서 입력된 손동작 비디오는 세그먼트로 나뉘고 짧은 정보(short snippet)는 매 프레임마다 각 세그먼트에서 선택될 수 있다. 여기서는 = 2, = 24를 설정할 수 있다. 서로 다른 짧은 정보(short snippet)의 클래스 점수가 융합되어 최종 예측을 생성할 수 있다. 네트워크에는 두 개의 팽창된 3D ResNet이 있으며 매개 변수를 공유할 수 있다. 훈련 과정에서 팽창된 각 3D ResNet에 대해 완전 연결 계층의 출력에 대해 소프트 맥스 연산을 실행한 다음 교차 엔트로피 손실을 사용하여 모델 출력과 레이블 사이의 오류를 계산할 수 있다.Referring to Fig. 1b, the framework of the two-stream hand gesture recognition network (top) and the structure of the inflated 3D ResNet50 (bottom) are shown. As an example of an extended 3D module, a two-stream hand gesture recognition network based on 3D ResNets can be proposed. Hand gesture video input from the network framework is divided into segments, and short snippets can be selected from each segment for each frame. Here, = 2, = 24 can be set. Class scores from different short snippets can be fused to create a final prediction. The network has two inflated 3D ResNets and can share parameters. For each inflated 3D ResNet during training, we can run a softmax operation on the output of the fully connected layer and then use the cross entropy loss to compute the error between the model output and the label.

도 2는 일 실시예에 따른 2차원 모듈을 설명하기 위한 모듈이다. 그리고 도 3은 일 실시예에 따른 3차원 모듈을 설명하기 위한 도면이다.2 is a module for explaining a 2D module according to an embodiment. 3 is a diagram for explaining a 3D module according to an embodiment.

도 2를 참조하면 2차원 모듈(residual module)의 예시를 나타내며, 도 3을 참조하면 확장된 3차원 모듈의 예시를 나타낸다. 내부적으로 각각의 프레임(frame)은 2D 이미지이며 딥러닝 과정에서 2D 이미지를 3D로 전환하는 방법은 다음의 과정을 거칠 수 있다. 2D 과정에서 모든 필터와 풀링 커널(pooling kernel)을 확장하여 시간 차원을 부여할 수 있다. 필터는 일반적으로 N*N이고 이를 N*N*N으로 확장할 수 있다. 따라서 2차원 모듈에 시간 차원을 추가하고 이를 학습 모듈에 입력으로 사용할 수 있다.Referring to FIG. 2, an example of a 2D module (residual module) is shown, and referring to FIG. 3, an example of an extended 3D module is shown. Internally, each frame is a 2D image, and a method of converting a 2D image into a 3D image in a deep learning process may go through the following process. In a 2D process, all filters and pooling kernels can be extended to give a temporal dimension. Filters are usually N*N, and you can extend them to N*N*N. So we can add a temporal dimension to the 2D module and use it as an input to the learning module.

도 4는 일 실시예에 따른 컴퓨터 시스템의 내부 구성의 일례를 설명하기 위한 블록도이다.4 is a block diagram for explaining an example of an internal configuration of a computer system according to an exemplary embodiment.

예를 들어, 본 발명의 실시예들에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 장치가 도 4의 컴퓨터 시스템(장치)(400)을 통해 구현될 수 있다. 도 4에 도시한 바와 같이, 컴퓨터 시스템(400)은 손동작 인식을 위한 TSN 기반의 딥러닝 방법을 실행하기 위한 구성요소로서 프로세서(410), 메모리(420), 영구 저장 장치(430), 버스(440), 입출력 인터페이스(450) 및 네트워크 인터페이스(460)를 포함할 수 있다.For example, a TSN-based deep learning device for recognizing hand gestures according to embodiments of the present invention may be implemented through the computer system (device) 400 of FIG. 4 . As shown in FIG. 4, the computer system 400 includes a processor 410, a memory 420, a permanent storage device 430, a bus ( 440), an input/output interface 450 and a network interface 460.

프로세서(410)는 명령어들의 임의의 시퀀스를 처리할 수 있는 임의의 장치를 포함하거나 그의 일부일 수 있다. 프로세서(410)는 예를 들어 컴퓨터 프로세서, 이동 장치 또는 다른 전자 장치 내의 프로세서 및/또는 디지털 프로세서를 포함할 수 있다. 프로세서(410)는 예를 들어, 서버 컴퓨팅 디바이스, 서버 컴퓨터, 일련의 서버 컴퓨터들, 서버 팜, 클라우드 컴퓨터, 컨텐츠 플랫폼, 이동 컴퓨팅 장치, 스마트폰, 태블릿, 셋톱 박스, 미디어 플레이어 등에 포함될 수 있다. 프로세서(410)는 버스(440)를 통해 메모리(420)에 접속될 수 있다.Processor 410 may include or be part of any device capable of processing any sequence of instructions. Processor 410 may include, for example, a computer processor, a processor in a mobile device or other electronic device, and/or a digital processor. Processor 410 may be included, for example, in a server computing device, server computer, series of server computers, server farm, cloud computer, content platform, mobile computing device, smartphone, tablet, set top box, media player, and the like. Processor 410 may be connected to memory 420 through bus 440 .

메모리(420)는 컴퓨터 시스템(400)에 의해 사용되거나 그에 의해 출력되는 정보를 저장하기 위한 휘발성 메모리, 영구, 가상 또는 기타 메모리를 포함할 수 있다. 메모리(420)는 예를 들어 랜덤 액세스 메모리(RAM: random access memory) 및/또는 동적 RAM(DRAM: dynamic RAM)을 포함할 수 있다. 메모리(420)는 컴퓨터 시스템(400)의 상태 정보와 같은 임의의 정보를 저장하는 데 사용될 수 있다. 메모리(420)는 예를 들어 손동작 인식을 위한 TSN 기반의 딥러닝을 위한 명령어들을 포함하는 컴퓨터 시스템(400)의 명령어들을 저장하는 데에도 사용될 수 있다. 컴퓨터 시스템(400)은 필요에 따라 또는 적절한 경우에 하나 이상의 프로세서(410)를 포함할 수 있다.Memory 420 may include volatile memory, permanent, virtual or other memory for storing information used by or output by computer system 400 . Memory 420 may include, for example, random access memory (RAM) and/or dynamic RAM (DRAM). Memory 420 may be used to store any information, such as state information of computer system 400. The memory 420 may also be used to store instructions of the computer system 400 including, for example, instructions for TSN-based deep learning for hand gesture recognition. Computer system 400 may include one or more processors 410 as needed or appropriate.

버스(440)는 컴퓨터 시스템(400)의 다양한 컴포넌트들 사이의 상호작용을 가능하게 하는 통신 기반 구조를 포함할 수 있다. 버스(440)는 예를 들어 컴퓨터 시스템(400)의 컴포넌트들 사이에, 예를 들어 프로세서(410)와 메모리(420) 사이에 데이터를 운반할 수 있다. 버스(440)는 컴퓨터 시스템(400)의 컴포넌트들 간의 무선 및/또는 유선 통신 매체를 포함할 수 있으며, 병렬, 직렬 또는 다른 토폴로지 배열들을 포함할 수 있다.Bus 440 may include a communication infrastructure that enables interaction between the various components of computer system 400 . Bus 440 may carry data between components of computer system 400, for example, between processor 410 and memory 420, for example. Bus 440 may include a wireless and/or wired communications medium between components of computer system 400, and may include parallel, serial, or other topological arrangements.

영구 저장 장치(430)는 (예를 들어, 메모리(420)에 비해) 소정의 연장된 기간 동안 데이터를 저장하기 위해 컴퓨터 시스템(400)에 의해 사용되는 바와 같은 메모리 또는 다른 영구 저장 장치와 같은 컴포넌트들을 포함할 수 있다. 영구 저장 장치(430)는 컴퓨터 시스템(400) 내의 프로세서(410)에 의해 사용되는 바와 같은 비휘발성 메인 메모리를 포함할 수 있다. 영구 저장 장치(430)는 예를 들어 플래시 메모리, 하드 디스크, 광 디스크 또는 다른 컴퓨터 판독 가능 매체를 포함할 수 있다.Persistent storage device 430 is a component such as a memory or other persistent storage device as used by computer system 400 to store data for some extended period of time (e.g., relative to memory 420). may include Persistent storage 430 may include non-volatile main memory as used by processor 410 in computer system 400 . Persistent storage device 430 may include, for example, flash memory, hard disk, optical disk, or other computer readable media.

입출력 인터페이스(450)는 키보드, 마우스, 음성 명령 입력, 디스플레이 또는 다른 입력 또는 출력 장치에 대한 인터페이스들을 포함할 수 있다. 구성 명령들 및/또는 손동작 인식을 위한 TSN 기반의 딥러닝을 위한 정보가 입출력 인터페이스(450)를 통해 수신될 수 있다.Input/output interface 450 may include interfaces to a keyboard, mouse, voice command input, display, or other input or output device. Configuration commands and/or information for TSN-based deep learning for recognizing hand gestures may be received through the input/output interface 450 .

네트워크 인터페이스(460)는 근거리 네트워크 또는 인터넷과 같은 네트워크들에 대한 하나 이상의 인터페이스를 포함할 수 있다. 네트워크 인터페이스(460)는 유선 또는 무선 접속들에 대한 인터페이스들을 포함할 수 있다. 구성 명령들 및/또는 손동작 인식을 위한 TSN 기반의 딥러닝을 위한 정보는 네트워크 인터페이스(460)를 통해 수신될 수 있다.Network interface 460 may include one or more interfaces to networks, such as a local area network or the Internet. Network interface 460 may include interfaces for wired or wireless connections. Configuration commands and/or information for TSN-based deep learning for hand gesture recognition may be received through the network interface 460 .

또한, 다른 실시예들에서 컴퓨터 시스템(400)은 도 4의 구성요소들보다 더 많은 구성요소들을 포함할 수도 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다. 예를 들어, 컴퓨터 시스템(400)은 상술한 입출력 인터페이스(450)와 연결되는 입출력 장치들 중 적어도 일부를 포함하도록 구현되거나 또는 트랜시버(transceiver), GPS(Global Positioning System) 모듈, 카메라, 각종 센서, 데이터베이스 등과 같은 다른 구성요소들을 더 포함할 수도 있다.Also, in other embodiments, computer system 400 may include more components than those of FIG. 4 . However, there is no need to clearly show most of the prior art components. For example, the computer system 400 is implemented to include at least some of the input/output devices connected to the above-described input/output interface 450, or a transceiver, a Global Positioning System (GPS) module, a camera, various sensors, It may further include other components such as databases and the like.

도 5는 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 장치를 나타내는 블록도이다.5 is a block diagram illustrating a TSN-based deep learning device for hand gesture recognition according to an embodiment.

도 5는 도 4의 일 실시예에 따른 컴퓨터 시스템(400)의 프로세서(410)가 포함할 수 있는 구성요소의 예를 도시한 도면이다. 여기서, 컴퓨터 시스템(400)의 프로세서(410)는 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 장치(500)를 포함할 수 있다. 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 장치(500)는 영상 구역화부(510), 영상 딥러닝부(520), 영상 통합부(530) 및 손동작 인식부(540)를 포함하여 이루어질 수 있다. 여기서, 영상 구역화부(510)는 제1 구역화부(511) 및 제2 구역화부(512)를 포함할 수 있다.FIG. 5 is a diagram illustrating an example of components that may be included in the processor 410 of the computer system 400 according to the embodiment of FIG. 4 . Here, the processor 410 of the computer system 400 may include the TSN-based deep learning device 500 for hand gesture recognition according to an embodiment. The TSN-based deep learning apparatus 500 for hand gesture recognition according to an embodiment includes an image zoning unit 510, an image deep learning unit 520, an image integration unit 530, and a hand gesture recognizing unit 540. It can be done. Here, the image zoning unit 510 may include a first zoning unit 511 and a second zoning unit 512 .

프로세서(410) 및 프로세서(410)의 구성요소들은 도 6의 손동작 인식을 위한 TSN 기반의 딥러닝 방법이 포함하는 단계들(S110 내지 S140)을 수행할 수 있다. 예를 들어, 프로세서(410) 및 프로세서(410)의 구성요소들은 메모리(420)가 포함하는 운영체제의 코드와 상술한 적어도 하나의 프로그램 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다. 여기서, 적어도 하나의 프로그램 코드는 수계오염원 모니터링 방법을 처리하기 위해 구현된 프로그램의 코드에 대응될 수 있다.The processor 410 and components of the processor 410 may perform steps S110 to S140 included in the TSN-based deep learning method for hand gesture recognition of FIG. 6 . For example, the processor 410 and components of the processor 410 may be implemented to execute an instruction according to an operating system code included in the memory 420 and at least one program code described above. Here, the at least one program code may correspond to a code of a program implemented to process the water-based pollutant monitoring method.

손동작 인식을 위한 TSN 기반의 딥러닝 방법은 도시된 순서대로 발생하지 않을 수 있으며, 단계들 중 일부가 생략되거나 추가의 과정이 더 포함될 수 있다.The TSN-based deep learning method for hand gesture recognition may not occur in the shown order, and some of the steps may be omitted or additional processes may be further included.

도 6은 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 방법을 나타내는 흐름도이다.6 is a flowchart illustrating a TSN-based deep learning method for hand gesture recognition according to an embodiment.

도 6을 참조하면, 일 실시예에 따른 컴퓨터 장치에 의해 수행되는 동영상 중 실시간 손동작 인식을 위한 TSN(Temporal Segment Networks) 기반의 딥러닝 방법은, 손동작 인식을 위한 TSN에서 전체 동영상을 시계열 영상 또는 공간 영상에 따라 구역화하는 단계(S110), 시계열 영상에 대한 딥러닝 및 공간 영상에 대한 딥러닝을 각각 수행하는 단계(S120), 딥러닝이 수행된 시계열 영상들을 통합하고, 딥러닝이 수행된 공간 영상들을 통합하는 단계(S130), 및 통합된 시계열 영상들과 통합된 공간 영상들을 이용하여 최종 분류하는 단계(S140)를 포함하여 이루어질 수 있다. Referring to FIG. 6 , a deep learning method based on Temporal Segment Networks (TSN) for real-time recognition of hand gestures in a video performed by a computer device according to an exemplary embodiment converts the entire video into a time-series image or spatial image in TSN for hand gesture recognition. Segmenting according to images (S110), performing deep learning on time-series images and deep learning on spatial images, respectively (S120), integrating time-series images on which deep learning has been performed, and spatial images on which deep learning has been performed It may include a step of integrating (S130), and a final classification step (S140) using the integrated time-series images and the integrated spatial images.

여기서 구역화하는 단계(S110)는, 동영상을 복수개의 짧은 정보(short snippet)들로 나누는 단계, 및 각각의 짧은 정보들을 시계열 영상 또는 공간 영상에 따라 구역화하는 단계를 포함하여 이루어질 수 있다. Here, the step of zoning (S110) may include dividing the video into a plurality of short snippets, and zoning each of the short snippets according to a time-series image or a spatial image.

아래에서 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 방법의 각 단계를 설명한다.Below, each step of the TSN-based deep learning method for hand gesture recognition according to an embodiment will be described.

일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 방법은 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 장치를 예를 들어 보다 구체적으로 설명할 수 있다. A TSN-based deep learning method for recognizing a hand gesture according to an embodiment may be described in more detail by taking a TSN-based deep learning apparatus for recognizing a hand gesture according to an embodiment as an example.

단계(S110)에서, 영상 구역화부(510)는 손동작 인식을 위한 TSN에서 전체 동영상을 시계열 영상 또는 공간 영상에 따라 구역화할 수 있다. In step S110, the image segmentation unit 510 may zone the entire video according to time-series images or spatial images in the TSN for hand gesture recognition.

여기서, 영상 구역화부(510)는 제1 구역화부(511) 및 제2 구역화부(512)를 포함할 수 있다. 보다 구체적으로, 제1 구역화부(511)는 동영상을 복수개의 짧은 정보(short snippet)들로 나눌 수 있다. 이후, 제2 구역화부(512)는 각각의 짧은 정보들을 시계열 영상 또는 공간 영상에 따라 구역화할 수 있다.Here, the image zoning unit 510 may include a first zoning unit 511 and a second zoning unit 512 . More specifically, the first zoning unit 511 may divide the video into a plurality of short snippets. Then, the second zoning unit 512 may zone each piece of short information according to a time-series image or a spatial image.

단계(S120)에서, 영상 딥러닝부(520)는 시계열 영상에 대한 딥러닝 및 공간 영상에 대한 딥러닝을 각각 수행할 수 있다. In step S120, the image deep learning unit 520 may perform deep learning on time-series images and deep learning on spatial images, respectively.

단계(S130)에서, 영상 통합부(530)는 딥러닝이 수행된 시계열 영상들을 통합하고, 딥러닝이 수행된 공간 영상들을 통합할 수 있다. In step S130, the image integrator 530 may integrate time-series images on which deep learning is performed and spatial images on which deep learning is performed.

단계(S140)에서, 손동작 인식부(540)는 통합된 시계열 영상들과 통합된 공간 영상들을 이용하여 최종 분류할 수 있다. 이 때, 손동작 인식부(540)는 통합된 시계열 영상들과 통합된 공간 영상들을 이용하여 기저장된 정보들을 이용하여 동영상의 손동작을 포함하고 있는지 분류할 수 있다. In step S140, the hand gesture recognition unit 540 may perform final classification using the integrated time-series images and the integrated spatial images. At this time, the hand gesture recognizing unit 540 may classify whether the hand gesture of the video is included using pre-stored information using integrated time-series images and integrated spatial images.

이상과 같이, 실시예들에 따르면 3차원 동영상에서 여러 장의 2D 액션이 연결된 모션을 인지하고 시간 또는 공간 구역화 후 딥러닝을 수행함으로써 실시간으로 손동작을 인식할 수 있다.As described above, according to the embodiments, hand gestures can be recognized in real time by recognizing motions in which several 2D actions are connected in a 3D video and performing deep learning after temporal or spatial zoning.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The devices described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. A processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or the components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims

In the deep learning method based on TSN (Temporal Segment Networks) for real-time hand motion recognition in a video performed by a computer device,
Segmenting the entire video according to a time-series image or a spatial image in TSN for hand gesture recognition;
performing deep learning on the time series image and deep learning on the spatial image, respectively;
integrating time-series images on which deep learning is performed and integrating spatial images on which deep learning is performed; and
Final classification using the integrated spatial images with the integrated time-series images
Including, TSN-based deep learning method for hand gesture recognition.

According to claim 1,
The zoning step is
Dividing a video into a plurality of short snippets; and
Zoning each of the short pieces of information according to a time-series image or a spatial image.
Including, TSN-based deep learning method for hand gesture recognition.

According to claim 1,
In the final classification step,
Classifying whether or not the hand gesture of the video is included using pre-stored information using the integrated time-series images and the integrated spatial images
Characterized by, TSN-based deep learning method for hand gesture recognition.

In a deep learning device based on TSN (Temporal Segment Networks) for real-time hand motion recognition in a video,
In TSN for hand gesture recognition, an image segmentation unit that zones the entire video according to time-series images or spatial images;
an image deep learning unit that performs deep learning on the time series image and deep learning on the spatial image, respectively;
an image integration unit that integrates time-series images on which deep learning has been performed and spatial images on which deep learning has been performed; and
Hand gesture recognition unit for final classification using the integrated time-series images and the integrated spatial images
Including, TSN-based deep learning device for hand gesture recognition.

According to claim 4,
The video zoning unit,
a first zoning unit dividing a video into a plurality of short snippets; and
A second zoning unit for zoning each of the short pieces of information according to a time-series image or a spatial image.
Including, TSN-based deep learning device for hand gesture recognition.