KR102587234B1

KR102587234B1 - Deep learning method and apparatus based on tsn for real-time hand gesture recognition in video

Info

Publication number: KR102587234B1
Application number: KR1020210061541A
Authority: KR
Inventors: 신병석; 조종경; 이연
Original assignee: 인하대학교 산학협력단
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2023-10-10
Also published as: KR20220153929A

Abstract

동영상 중 실시간 손동작 인식을 위한 TSN 기반의 딥러닝 방법 및 장치가 제시된다. 일 실시예에 따른 컴퓨터 장치에 의해 수행되는 동영상 중 실시간 손동작 인식을 위한 TSN(Temporal Segment Networks) 기반의 딥러닝 방법은, 손동작 인식을 위한 TSN에서 전체 동영상을 시계열 영상 또는 공간 영상에 따라 구역화하는 단계; 상기 시계열 영상에 대한 딥러닝 및 상기 공간 영상에 대한 딥러닝을 각각 수행하는 단계; 딥러닝이 수행된 시계열 영상들을 통합하고, 딥러닝이 수행된 공간 영상들을 통합하는 단계; 및 통합된 상기 시계열 영상들과 통합된 상기 공간 영상들을 이용하여 최종 분류하는 단계를 포함하여 이루어질 수 있다. A TSN-based deep learning method and device for real-time hand gesture recognition in videos is presented. A deep learning method based on TSN (Temporal Segment Networks) for real-time hand gesture recognition among videos performed by a computer device according to an embodiment includes the step of zoning the entire video according to a time series image or spatial image in the TSN for hand gesture recognition. ; performing deep learning on the time series image and deep learning on the spatial image, respectively; Integrating time series images on which deep learning was performed and integrating spatial images on which deep learning was performed; and a final classification step using the integrated time series images and the integrated spatial images.

Description

TSN-based deep learning method and device for real-time hand gesture recognition in videos {DEEP LEARNING METHOD AND APPARATUS BASED ON TSN FOR REAL-TIME HAND GESTURE RECOGNITION IN VIDEO}

아래의 실시예들은 손동작 인식 방법 및 장치에 관한 것으로, 보다 구체적으로 동영상 중 실시간 손동작 인식을 위한 TSN 기반의 딥러닝 방법 및 장치에 관한 것이다. The following embodiments relate to a hand gesture recognition method and device, and more specifically, to a TSN-based deep learning method and device for real-time hand gesture recognition in videos.

모바일 폰은 삶의 큰 부분을 차지하며, 모바일 폰에서 생성되는 인터넷 트래픽의 비율이 데스크탑을 능가하고 있다. 이러한 전환으로 인해 인간이 이러한 기기들과 상호작용하는 방식도 키보드/마우스에서 제스처, 음성 및 뇌 컴퓨터 인터페이스로 진화하여 왔다. 예컨대 음성 인터페이스의 경우 시끄러운 실외 환경이나 조용히 해야 하는 환경에서는 음성 인터페이스가 정확하지 못하거나 사용하기 어려운 문제점이 있으므로, 현재 손 제스처 인터페이스(hand gestural interface)와 음성의 조합이 사용되고 있다. Mobile phones are a huge part of our lives, and the percentage of Internet traffic generated from mobile phones exceeds that of desktops. Due to this transition, the way humans interact with these devices has evolved from keyboard/mouse to gesture, voice, and brain-computer interfaces. For example, in the case of voice interfaces, there are problems with voice interfaces being inaccurate or difficult to use in noisy outdoor environments or environments that require silence, so a combination of a hand gestural interface and voice is currently used.

또한, 다양한 손 제스처의 인식을 프로그램의 입력으로 사용하여 키보드나 패널 터치의 입력을 대체하려는 수요가 늘어나고 있다. 예를 들어 주변 환경이 복잡한 자율주행자동차의 운행 과정에 손 제스처 인식을 자율 주행의 명령으로 대체하여 사용할 수 있다면 버튼을 누르는 등 액션을 피할 수 있으므로 사고의 위험을 줄일 수 있다.In addition, there is an increasing demand to replace keyboard or panel touch input by using recognition of various hand gestures as input to programs. For example, if hand gesture recognition can be replaced with autonomous driving commands during the operation of a self-driving car in a complex surrounding environment, actions such as pressing buttons can be avoided, thereby reducing the risk of accidents.

기존의 TSN(Temporal Segment Networks) 기법은 전체 동영상을 입력으로 할 때 해당 동영상의 주제를 출력으로 하는데, 이러한 기법은 실시간으로 손 제스처를 인식하고 이를 명령어로 사용하는 응용에 적합하지 않다. 그러므로 시간 구간별로 손 제스처를 인식하여 출력할 수 있는 기법이 필요하다.The existing TSN (Temporal Segment Networks) technique uses the entire video as input and outputs the subject of the video, but this technique is not suitable for applications that recognize hand gestures in real time and use them as commands. Therefore, a technique that can recognize and output hand gestures for each time interval is needed.

한국등록특허 10-2121654호는 이러한 딥러닝 기반 제스처 자동 인식 방법 및 시스템에 관한 기술을 기재하고 있다.Korean Patent No. 10-2121654 describes technology related to this deep learning-based automatic gesture recognition method and system.

한국등록특허 10-2121654호Korean Patent No. 10-2121654

실시예들은 동영상 중 실시간 손동작 인식을 위한 TSN 기반의 딥러닝 방법 및 장치에 관하여 기술하며, 보다 구체적으로 단순 2차원 이미지의 손 제스처 인식이 여러 장의 2D 액션이 연결된 모션을 인지하는 3차원 동영상에서의 손동작 인식 기법인 TSN(Temporal Segment Networks) 기반의 실시간 딥러닝 기술을 제공한다. Embodiments describe a TSN-based deep learning method and device for real-time hand gesture recognition in videos, and more specifically, hand gesture recognition of simple two-dimensional images in three-dimensional videos that recognize motions where multiple 2D actions are connected. It provides real-time deep learning technology based on TSN (Temporal Segment Networks), a hand gesture recognition technique.

실시예들은 3차원 동영상에서 여러 장의 2D 액션이 연결된 모션을 인지하고 딥러닝 기반으로 실시간으로 손동작을 인식할 수 있는, 동영상 중 실시간 손동작 인식을 위한 TSN 기반의 딥러닝 방법 및 장치를 제공하는데 있다. Embodiments provide a TSN-based deep learning method and device for real-time hand gesture recognition in a video, which can recognize motion linked to multiple 2D actions in a 3D video and recognize hand gestures in real time based on deep learning.

일 실시예에 따른 컴퓨터 장치에 의해 수행되는 동영상 중 실시간 손동작 인식을 위한 TSN(Temporal Segment Networks) 기반의 딥러닝 방법은, 손동작 인식을 위한 TSN에서 전체 동영상을 시계열 영상 또는 공간 영상에 따라 구역화하는 단계; 상기 시계열 영상에 대한 딥러닝 및 상기 공간 영상에 대한 딥러닝을 각각 수행하는 단계; 딥러닝이 수행된 시계열 영상들을 통합하고, 딥러닝이 수행된 공간 영상들을 통합하는 단계; 및 통합된 상기 시계열 영상들과 통합된 상기 공간 영상들을 이용하여 최종 분류하는 단계를 포함하여 이루어질 수 있다. A deep learning method based on TSN (Temporal Segment Networks) for real-time hand gesture recognition among videos performed by a computer device according to an embodiment includes the step of zoning the entire video according to a time series image or spatial image in the TSN for hand gesture recognition. ; performing deep learning on the time series image and deep learning on the spatial image, respectively; Integrating time series images on which deep learning was performed and integrating spatial images on which deep learning was performed; and a final classification step using the integrated time series images and the integrated spatial images.

상기 구역화하는 단계는, 동영상을 복수개의 짧은 정보(short snippet)들로 나누는 단계; 및 각각의 상기 짧은 정보들을 시계열 영상 또는 공간 영상에 따라 구역화하는 단계를 포함하여 이루어질 수 있다. The zoning step includes dividing the video into a plurality of short snippets; and zoning each of the short pieces of information according to a time series image or a spatial image.

상기 최종 분류하는 단계는, 통합된 상기 시계열 영상들과 통합된 상기 공간 영상들을 이용하여 기저장된 정보들을 이용하여 상기 동영상의 손동작을 포함하고 있는지 분류할 수 있다. In the final classification step, pre-stored information using the integrated time-series images and the integrated spatial images can be used to classify whether the video includes hand gestures.

다른 실시예에 따른 동영상 중 실시간 손동작 인식을 위한 TSN(Temporal Segment Networks) 기반의 딥러닝 장치는, 손동작 인식을 위한 TSN에서 전체 동영상을 시계열 영상 또는 공간 영상에 따라 구역화하는 영상 구역화부; 상기 시계열 영상에 대한 딥러닝 및 상기 공간 영상에 대한 딥러닝을 각각 수행하는 영상 딥러닝부; 딥러닝이 수행된 시계열 영상들을 통합하고, 딥러닝이 수행된 공간 영상들을 통합하는 영상 통합부; 및 통합된 상기 시계열 영상들과 통합된 상기 공간 영상들을 이용하여 최종 분류하는 손동작 인식부를 포함하여 이루어질 수 있다. According to another embodiment, a deep learning device based on TSN (Temporal Segment Networks) for real-time hand gesture recognition among videos includes an image zoning unit for zoning the entire video according to a time series image or spatial image in the TSN for hand gesture recognition; An image deep learning unit that performs deep learning on the time series image and deep learning on the spatial image, respectively; An image integration unit that integrates time series images on which deep learning has been performed and spatial images on which deep learning has been performed; and a hand gesture recognition unit that performs final classification using the integrated time series images and the integrated spatial images.

상기 영상 구역화부는, 동영상을 복수개의 짧은 정보(short snippet)들로 나누는 제1 구역화부; 및 각각의 상기 짧은 정보들을 시계열 영상 또는 공간 영상에 따라 구역화하는 제2 구역화부를 포함할 수 있다. The video zoning unit includes a first zoning unit that divides the video into a plurality of short snippets; and a second zoning unit for zoning each of the short pieces of information according to a time series image or a spatial image.

실시예들에 따르면 3차원 동영상에서 여러 장의 2D 액션이 연결된 모션을 인지하고 딥러닝 기반으로 실시간으로 손동작을 인식할 수 있는, 동영상 중 실시간 손동작 인식을 위한 TSN 기반의 딥러닝 방법 및 장치를 제공할 수 있다. According to embodiments, a TSN-based deep learning method and device for real-time hand gesture recognition in videos is provided, which can recognize motion connected to multiple 2D actions in a 3D video and recognize hand gestures in real time based on deep learning. You can.

도 1a은 일 실시예에 따른 실시간 손동작 인식을 위한 TSN 구조를 개략적으로 나타내는 도면이다.
도 1b는 일 실시예에 따른 2 스트림의 실시간 손동작 인식을 위한 TSN 구조의 예시를 나타내는 도면이다.
도 2는 일 실시예에 따른 2차원 모듈을 설명하기 위한 모듈이다.
도 3은 일 실시예에 따른 3차원 모듈을 설명하기 위한 도면이다.
도 4는 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 방법을 나타내는 흐름도이다.
도 5는 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 장치를 나타내는 블록도이다.FIG. 1A is a diagram schematically showing a TSN structure for real-time hand gesture recognition according to an embodiment.
FIG. 1B is a diagram illustrating an example of a TSN structure for real-time hand gesture recognition of two streams according to an embodiment.
Figure 2 is a module for explaining a two-dimensional module according to an embodiment.
Figure 3 is a diagram for explaining a 3D module according to an embodiment.
Figure 4 is a flowchart showing a TSN-based deep learning method for hand gesture recognition according to an embodiment.
Figure 5 is a block diagram showing a TSN-based deep learning device for hand gesture recognition according to an embodiment.

이하, 첨부된 도면을 참조하여 실시예들을 설명한다. 그러나, 기술되는 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명되는 실시예들에 의하여 한정되는 것은 아니다. 또한, 여러 실시예들은 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다. 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, embodiments will be described with reference to the attached drawings. However, the described embodiments may be modified into various other forms, and the scope of the present invention is not limited to the embodiments described below. In addition, various embodiments are provided to more completely explain the present invention to those with average knowledge in the art. The shapes and sizes of elements in the drawings may be exaggerated for clearer explanation.

아래의 실시예들은 동영상 중 실시간 손동작 인식을 위한 TSN 기반의 딥러닝 방법 및 장치에 관한 것으로, 단순 2차원 이미지의 손 제스처 인식이 여러 장의 2D 액션이 연결된 모션을 인지하는 3차원 동영상에서의 손동작 인식 기법인 TSN(Temporal Segment Networks) 기반의 실시간 딥러닝 기술을 제안한다.The following examples relate to a TSN-based deep learning method and device for real-time hand gesture recognition in videos. Hand gesture recognition in a simple 2D image can be used to recognize hand gestures in a 3D video that recognizes the motion of multiple 2D actions connected. We propose a real-time deep learning technology based on Temporal Segment Networks (TSN).

도 1a은 일 실시예에 따른 실시간 손동작 인식을 위한 TSN 구조를 개략적으로 나타내는 도면이다.FIG. 1A is a diagram schematically showing a TSN structure for real-time hand gesture recognition according to an embodiment.

도 1a을 참조하면, 손동작(핸드모션) 인식을 위한 TSN(130)에서는 전체 동영상(110)을 구역화하고 시계열 영상에 대한 CNN 및 동일 공간 영상에 대한 CNN을 개별로 진행하고 각자 통합(140)하여 최종 분류(150)하는 과정을 거치게 된다. 여기서, 전체 동영상(110)은 복수개의 짧은 정보(short snippet)들(120)로 나눈 후 시계열 영상 및 동일 공간 영상으로 구역화할 수 있다.Referring to FIG. 1A, in the TSN 130 for hand motion recognition, the entire video 110 is segmented, the CNN for the time series image and the CNN for the same space image are separately processed and integrated (140). It goes through a final classification process (150). Here, the entire video 110 can be divided into a plurality of short snippets 120 and then divided into time series images and co-space images.

예를 들어, 영상 V를 m 개의 구역으로 나눌 경우 {S₁, S₂, ···, S_m}으로 구분될 수 있다. 이 때, TSN은 다음 식과 같이 나타낼 수 있다. For example, if image V is divided into m sections, it can be divided into {S ₁ , S ₂ , ···, S _m }. At this time, TSN can be expressed as the following equation.

[식 1][Equation 1]

여기서, T _m 은 m 개의 구역 중 임의로 샘플링된 구역이고, F( T _m ; W)은 T _m 을 입력하고 W를 파라미터로 하는 CNN 딥러닝의 함수이고 분류 결과를 출력으로 한다. G는 구역화 통합 함수이며, 이는 모든 구역의 결과값을 통합하는 역할을 한다. H는 해당 동영상이 어떤 손 제스처를 포함하고 있는지 분류하는 함수이다.Here, T _m is a randomly sampled area among m areas, and F( T _m ; W) is a function of CNN deep learning with T _m as input and W as a parameter, and the classification result as output. G is the zoning integration function, which integrates the results of all zones. H is a function that classifies which hand gestures the video contains.

도 1b는 일 실시예에 따른 2 스트림의 실시간 손동작 인식을 위한 TSN 구조의 예시를 나타내는 도면이다.FIG. 1B is a diagram illustrating an example of a TSN structure for real-time hand gesture recognition of two streams according to an embodiment.

도 1b를 참조하면, 2 스트림 손동작 인식 네트워크(상단)의 프레임워크와 팽창된 3D ResNet50 (하단)의 구조를 나타낸다. 확장된 3차원 모듈의 예시로 3D ResNets 기반의 2 스트림 손동작 인식 네트워크를 제안할 수 있다. 네트워크 프레임 워크에서 입력된 손동작 비디오는 세그먼트로 나뉘고 짧은 정보(short snippet)는 매 프레임마다 각 세그먼트에서 선택될 수 있다. 여기서는 = 2, = 24를 설정할 수 있다. 서로 다른 짧은 정보(short snippet)의 클래스 점수가 융합되어 최종 예측을 생성할 수 있다. 네트워크에는 두 개의 팽창된 3D ResNet이 있으며 매개 변수를 공유할 수 있다. 훈련 과정에서 팽창된 각 3D ResNet에 대해 완전 연결 계층의 출력에 대해 소프트 맥스 연산을 실행한 다음 교차 엔트로피 손실을 사용하여 모델 출력과 레이블 사이의 오류를 계산할 수 있다.Referring to Figure 1b, it shows the framework of a two-stream hand gesture recognition network (top) and the structure of an expanded 3D ResNet50 (bottom). As an example of an extended 3D module, a two-stream hand gesture recognition network based on 3D ResNets can be proposed. In the network framework, the input hand gesture video is divided into segments and short snippets can be selected from each segment every frame. Here you can set = 2, = 24. Class scores from different short snippets can be fused to generate the final prediction. The network has two dilated 3D ResNets and can share parameters. During the training process, for each inflated 3D ResNet, we can run a soft max operation on the output of the fully connected layer and then use the cross-entropy loss to calculate the error between the model output and the labels.

도 2는 일 실시예에 따른 2차원 모듈을 설명하기 위한 모듈이다. 그리고 도 3은 일 실시예에 따른 3차원 모듈을 설명하기 위한 도면이다.Figure 2 is a module for explaining a two-dimensional module according to an embodiment. And FIG. 3 is a diagram for explaining a 3D module according to an embodiment.

도 2를 참조하면 2차원 모듈(residual module)의 예시를 나타내며, 도 3을 참조하면 확장된 3차원 모듈의 예시를 나타낸다. 내부적으로 각각의 프레임(frame)은 2D 이미지이며 딥러닝 과정에서 2D 이미지를 3D로 전환하는 방법은 다음의 과정을 거칠 수 있다. 2D 과정에서 모든 필터와 풀링 커널(pooling kernel)을 확장하여 시간 차원을 부여할 수 있다. 필터는 일반적으로 N*N이고 이를 N*N*N으로 확장할 수 있다. 따라서 2차원 모듈에 시간 차원을 추가하고 이를 학습 모듈에 입력으로 사용할 수 있다.Referring to FIG. 2, an example of a two-dimensional module (residual module) is shown, and FIG. 3 shows an example of an expanded three-dimensional module. Internally, each frame is a 2D image, and converting a 2D image to 3D during the deep learning process can be done through the following process. In a 2D process, all filters and pooling kernels can be extended to give a time dimension. The filter is usually N*N and can be extended to N*N*N. Therefore, we can add a time dimension to the 2-dimensional module and use it as input to the learning module.

도 4는 일 실시예에 따른 컴퓨터 시스템의 내부 구성의 일례를 설명하기 위한 블록도이다.Figure 4 is a block diagram for explaining an example of the internal configuration of a computer system according to an embodiment.

예를 들어, 본 발명의 실시예들에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 장치가 도 4의 컴퓨터 시스템(장치)(400)을 통해 구현될 수 있다. 도 4에 도시한 바와 같이, 컴퓨터 시스템(400)은 손동작 인식을 위한 TSN 기반의 딥러닝 방법을 실행하기 위한 구성요소로서 프로세서(410), 메모리(420), 영구 저장 장치(430), 버스(440), 입출력 인터페이스(450) 및 네트워크 인터페이스(460)를 포함할 수 있다.For example, a TSN-based deep learning device for hand gesture recognition according to embodiments of the present invention may be implemented through the computer system (device) 400 of FIG. 4. As shown in FIG. 4, the computer system 400 is a component for executing a TSN-based deep learning method for hand gesture recognition and includes a processor 410, a memory 420, a persistent storage device 430, and a bus ( 440), an input/output interface 450, and a network interface 460.

프로세서(410)는 명령어들의 임의의 시퀀스를 처리할 수 있는 임의의 장치를 포함하거나 그의 일부일 수 있다. 프로세서(410)는 예를 들어 컴퓨터 프로세서, 이동 장치 또는 다른 전자 장치 내의 프로세서 및/또는 디지털 프로세서를 포함할 수 있다. 프로세서(410)는 예를 들어, 서버 컴퓨팅 디바이스, 서버 컴퓨터, 일련의 서버 컴퓨터들, 서버 팜, 클라우드 컴퓨터, 컨텐츠 플랫폼, 이동 컴퓨팅 장치, 스마트폰, 태블릿, 셋톱 박스, 미디어 플레이어 등에 포함될 수 있다. 프로세서(410)는 버스(440)를 통해 메모리(420)에 접속될 수 있다.Processor 410 may include or be part of any device capable of processing any sequence of instructions. Processor 410 may include, for example, a computer processor, a processor in a mobile device or other electronic device, and/or a digital processor. Processor 410 may be included in, for example, a server computing device, server computer, series of server computers, server farm, cloud computer, content platform, mobile computing device, smartphone, tablet, set-top box, media player, etc. Processor 410 may be connected to memory 420 through bus 440.

메모리(420)는 컴퓨터 시스템(400)에 의해 사용되거나 그에 의해 출력되는 정보를 저장하기 위한 휘발성 메모리, 영구, 가상 또는 기타 메모리를 포함할 수 있다. 메모리(420)는 예를 들어 랜덤 액세스 메모리(RAM: random access memory) 및/또는 동적 RAM(DRAM: dynamic RAM)을 포함할 수 있다. 메모리(420)는 컴퓨터 시스템(400)의 상태 정보와 같은 임의의 정보를 저장하는 데 사용될 수 있다. 메모리(420)는 예를 들어 손동작 인식을 위한 TSN 기반의 딥러닝을 위한 명령어들을 포함하는 컴퓨터 시스템(400)의 명령어들을 저장하는 데에도 사용될 수 있다. 컴퓨터 시스템(400)은 필요에 따라 또는 적절한 경우에 하나 이상의 프로세서(410)를 포함할 수 있다.Memory 420 may include volatile memory, persistent, virtual, or other memory for storing information used by or output by computer system 400. Memory 420 may include, for example, random access memory (RAM) and/or dynamic RAM (DRAM). Memory 420 may be used to store any information, such as state information of computer system 400. The memory 420 may also be used to store instructions of the computer system 400, including instructions for TSN-based deep learning for hand gesture recognition, for example. Computer system 400 may include one or more processors 410 as needed or appropriate.

버스(440)는 컴퓨터 시스템(400)의 다양한 컴포넌트들 사이의 상호작용을 가능하게 하는 통신 기반 구조를 포함할 수 있다. 버스(440)는 예를 들어 컴퓨터 시스템(400)의 컴포넌트들 사이에, 예를 들어 프로세서(410)와 메모리(420) 사이에 데이터를 운반할 수 있다. 버스(440)는 컴퓨터 시스템(400)의 컴포넌트들 간의 무선 및/또는 유선 통신 매체를 포함할 수 있으며, 병렬, 직렬 또는 다른 토폴로지 배열들을 포함할 수 있다.Bus 440 may include a communications infrastructure that enables interaction between various components of computer system 400. Bus 440 may carry data between components of computer system 400, for example, between processor 410 and memory 420. Bus 440 may include wireless and/or wired communication media between components of computer system 400 and may include parallel, series, or other topological arrangements.

영구 저장 장치(430)는 (예를 들어, 메모리(420)에 비해) 소정의 연장된 기간 동안 데이터를 저장하기 위해 컴퓨터 시스템(400)에 의해 사용되는 바와 같은 메모리 또는 다른 영구 저장 장치와 같은 컴포넌트들을 포함할 수 있다. 영구 저장 장치(430)는 컴퓨터 시스템(400) 내의 프로세서(410)에 의해 사용되는 바와 같은 비휘발성 메인 메모리를 포함할 수 있다. 영구 저장 장치(430)는 예를 들어 플래시 메모리, 하드 디스크, 광 디스크 또는 다른 컴퓨터 판독 가능 매체를 포함할 수 있다.Persistent storage device 430 is a component, such as a memory or other persistent storage device, as used by computer system 400 to store data for some extended period of time (e.g., compared to memory 420). may include. Persistent storage device 430 may include non-volatile main memory, such as used by processor 410 within computer system 400. Persistent storage device 430 may include, for example, flash memory, hard disk, optical disk, or other computer-readable medium.

입출력 인터페이스(450)는 키보드, 마우스, 음성 명령 입력, 디스플레이 또는 다른 입력 또는 출력 장치에 대한 인터페이스들을 포함할 수 있다. 구성 명령들 및/또는 손동작 인식을 위한 TSN 기반의 딥러닝을 위한 정보가 입출력 인터페이스(450)를 통해 수신될 수 있다.The input/output interface 450 may include interfaces for a keyboard, mouse, voice command input, display, or other input or output device. Configuration commands and/or information for TSN-based deep learning for hand gesture recognition may be received through the input/output interface 450.

네트워크 인터페이스(460)는 근거리 네트워크 또는 인터넷과 같은 네트워크들에 대한 하나 이상의 인터페이스를 포함할 수 있다. 네트워크 인터페이스(460)는 유선 또는 무선 접속들에 대한 인터페이스들을 포함할 수 있다. 구성 명령들 및/또는 손동작 인식을 위한 TSN 기반의 딥러닝을 위한 정보는 네트워크 인터페이스(460)를 통해 수신될 수 있다.Network interface 460 may include one or more interfaces to networks, such as a local area network or the Internet. Network interface 460 may include interfaces for wired or wireless connections. Configuration commands and/or information for TSN-based deep learning for hand gesture recognition may be received through the network interface 460.

또한, 다른 실시예들에서 컴퓨터 시스템(400)은 도 4의 구성요소들보다 더 많은 구성요소들을 포함할 수도 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다. 예를 들어, 컴퓨터 시스템(400)은 상술한 입출력 인터페이스(450)와 연결되는 입출력 장치들 중 적어도 일부를 포함하도록 구현되거나 또는 트랜시버(transceiver), GPS(Global Positioning System) 모듈, 카메라, 각종 센서, 데이터베이스 등과 같은 다른 구성요소들을 더 포함할 수도 있다.Additionally, in other embodiments, computer system 400 may include more components than those of FIG. 4 . However, there is no need to clearly show most prior art components. For example, the computer system 400 is implemented to include at least some of the input/output devices connected to the input/output interface 450 described above, or a transceiver, a Global Positioning System (GPS) module, a camera, various sensors, It may further include other components such as a database, etc.

도 5는 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 장치를 나타내는 블록도이다.Figure 5 is a block diagram showing a TSN-based deep learning device for hand gesture recognition according to an embodiment.

도 5는 도 4의 일 실시예에 따른 컴퓨터 시스템(400)의 프로세서(410)가 포함할 수 있는 구성요소의 예를 도시한 도면이다. 여기서, 컴퓨터 시스템(400)의 프로세서(410)는 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 장치(500)를 포함할 수 있다. 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 장치(500)는 영상 구역화부(510), 영상 딥러닝부(520), 영상 통합부(530) 및 손동작 인식부(540)를 포함하여 이루어질 수 있다. 여기서, 영상 구역화부(510)는 제1 구역화부(511) 및 제2 구역화부(512)를 포함할 수 있다.FIG. 5 is a diagram illustrating examples of components that the processor 410 of the computer system 400 according to the embodiment of FIG. 4 may include. Here, the processor 410 of the computer system 400 may include a TSN-based deep learning device 500 for hand gesture recognition according to an embodiment. The TSN-based deep learning device 500 for hand gesture recognition according to an embodiment includes an image zoning unit 510, an image deep learning unit 520, an image integration unit 530, and a hand gesture recognition unit 540. It can be done. Here, the image zoning unit 510 may include a first zoning unit 511 and a second zoning unit 512.

프로세서(410) 및 프로세서(410)의 구성요소들은 도 6의 손동작 인식을 위한 TSN 기반의 딥러닝 방법이 포함하는 단계들(S110 내지 S140)을 수행할 수 있다. 예를 들어, 프로세서(410) 및 프로세서(410)의 구성요소들은 메모리(420)가 포함하는 운영체제의 코드와 상술한 적어도 하나의 프로그램 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다. 여기서, 적어도 하나의 프로그램 코드는 수계오염원 모니터링 방법을 처리하기 위해 구현된 프로그램의 코드에 대응될 수 있다.The processor 410 and its components may perform steps (S110 to S140) included in the TSN-based deep learning method for hand gesture recognition in FIG. 6. For example, the processor 410 and its components may be implemented to execute instructions according to the code of the operating system included in the memory 420 and at least one program code described above. Here, at least one program code may correspond to the code of a program implemented to process a water pollution source monitoring method.

손동작 인식을 위한 TSN 기반의 딥러닝 방법은 도시된 순서대로 발생하지 않을 수 있으며, 단계들 중 일부가 생략되거나 추가의 과정이 더 포함될 수 있다.The TSN-based deep learning method for hand gesture recognition may not occur in the order shown, and some of the steps may be omitted or additional processes may be included.

도 6은 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 방법을 나타내는 흐름도이다.Figure 6 is a flowchart showing a TSN-based deep learning method for hand gesture recognition according to an embodiment.

도 6을 참조하면, 일 실시예에 따른 컴퓨터 장치에 의해 수행되는 동영상 중 실시간 손동작 인식을 위한 TSN(Temporal Segment Networks) 기반의 딥러닝 방법은, 손동작 인식을 위한 TSN에서 전체 동영상을 시계열 영상 또는 공간 영상에 따라 구역화하는 단계(S110), 시계열 영상에 대한 딥러닝 및 공간 영상에 대한 딥러닝을 각각 수행하는 단계(S120), 딥러닝이 수행된 시계열 영상들을 통합하고, 딥러닝이 수행된 공간 영상들을 통합하는 단계(S130), 및 통합된 시계열 영상들과 통합된 공간 영상들을 이용하여 최종 분류하는 단계(S140)를 포함하여 이루어질 수 있다. Referring to FIG. 6, a deep learning method based on TSN (Temporal Segment Networks) for real-time hand gesture recognition among videos performed by a computer device according to an embodiment converts the entire video into a time series image or spatial image in TSN for hand gesture recognition. A step of zoning according to the image (S110), a step of performing deep learning on the time series image and deep learning on the spatial image respectively (S120), integrating the time series images on which deep learning was performed, and spatial images on which deep learning was performed It may include a step of integrating them (S130), and a step of final classification using the integrated time series images and the integrated spatial images (S140).

여기서 구역화하는 단계(S110)는, 동영상을 복수개의 짧은 정보(short snippet)들로 나누는 단계, 및 각각의 짧은 정보들을 시계열 영상 또는 공간 영상에 따라 구역화하는 단계를 포함하여 이루어질 수 있다. Here, the zoning step (S110) may include dividing the video into a plurality of short snippets and zoning each piece of short information according to a time series image or a spatial image.

아래에서 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 방법의 각 단계를 설명한다.Below, each step of the TSN-based deep learning method for hand gesture recognition according to an embodiment is described.

일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 방법은 일 실시예에 따른 손동작 인식을 위한 TSN 기반의 딥러닝 장치를 예를 들어 보다 구체적으로 설명할 수 있다. The TSN-based deep learning method for hand gesture recognition according to an embodiment can be explained in more detail by taking a TSN-based deep learning device for hand gesture recognition according to an embodiment as an example.

단계(S110)에서, 영상 구역화부(510)는 손동작 인식을 위한 TSN에서 전체 동영상을 시계열 영상 또는 공간 영상에 따라 구역화할 수 있다. In step S110, the image segmentation unit 510 may segment the entire video in TSN for hand gesture recognition according to time series images or spatial images.

여기서, 영상 구역화부(510)는 제1 구역화부(511) 및 제2 구역화부(512)를 포함할 수 있다. 보다 구체적으로, 제1 구역화부(511)는 동영상을 복수개의 짧은 정보(short snippet)들로 나눌 수 있다. 이후, 제2 구역화부(512)는 각각의 짧은 정보들을 시계열 영상 또는 공간 영상에 따라 구역화할 수 있다.Here, the image zoning unit 510 may include a first zoning unit 511 and a second zoning unit 512. More specifically, the first zoning unit 511 may divide the video into a plurality of short snippets. Thereafter, the second zoning unit 512 may zone each piece of short information according to a time series image or a spatial image.

단계(S120)에서, 영상 딥러닝부(520)는 시계열 영상에 대한 딥러닝 및 공간 영상에 대한 딥러닝을 각각 수행할 수 있다. In step S120, the image deep learning unit 520 may perform deep learning on time series images and deep learning on spatial images, respectively.

단계(S130)에서, 영상 통합부(530)는 딥러닝이 수행된 시계열 영상들을 통합하고, 딥러닝이 수행된 공간 영상들을 통합할 수 있다. In step S130, the image integration unit 530 may integrate time series images on which deep learning has been performed and spatial images on which deep learning has been performed.

단계(S140)에서, 손동작 인식부(540)는 통합된 시계열 영상들과 통합된 공간 영상들을 이용하여 최종 분류할 수 있다. 이 때, 손동작 인식부(540)는 통합된 시계열 영상들과 통합된 공간 영상들을 이용하여 기저장된 정보들을 이용하여 동영상의 손동작을 포함하고 있는지 분류할 수 있다. In step S140, the hand gesture recognition unit 540 may perform final classification using the integrated time series images and integrated spatial images. At this time, the hand gesture recognition unit 540 can classify whether the video contains a hand gesture using pre-stored information using integrated time series images and integrated spatial images.

이상과 같이, 실시예들에 따르면 3차원 동영상에서 여러 장의 2D 액션이 연결된 모션을 인지하고 시간 또는 공간 구역화 후 딥러닝을 수행함으로써 실시간으로 손동작을 인식할 수 있다.As described above, according to the embodiments, hand movements can be recognized in real time by recognizing the motion of multiple 2D actions connected in a 3D video and performing deep learning after temporal or spatial zoning.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented with hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general-purpose or special-purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. A processing device may execute an operating system (OS) and one or more software applications that run on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device includes multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include multiple processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. It can be embodied in . Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes optical media (magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, various modifications and variations can be made by those skilled in the art from the above description. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims also fall within the scope of the claims described below.

Claims

In a deep learning method based on TSN (Temporal Segment Networks) for real-time hand gesture recognition among videos performed by a computer device,
In TSN for hand gesture recognition, zoning the entire video according to time series image or spatial image;
performing deep learning on the time series image and deep learning on the spatial image, respectively;
Integrating time series images on which deep learning was performed and integrating spatial images on which deep learning was performed; and
Final classification step using the integrated time series images and the integrated spatial images
Including,
The zoning step is,
Dividing the video into a plurality of short snippets; and
It includes zoning each of the short pieces of information according to a time series image or a spatial image,
The steps of performing deep learning on the time series image and deep learning on the spatial image, respectively, are:
In order to convert each frame, which is a two-dimensional image, into a three-dimensional image in the deep learning process, all filters and pooling kernels are expanded to give a time dimension, thereby converting each frame into a three-dimensional image.
Hand gestures are recognized through a two-stream hand gesture recognition network based on two expanded 3D ResNets, and the short information is selected from each segment in each frame and fuses the class scores of different short information to generate time series images and spatial images. to integrate them
TSN-based deep learning method for hand gesture recognition.

delete

According to paragraph 1,
The final classification step is,
Classifying whether the video contains hand gestures using pre-stored information using the integrated time series images and the integrated spatial images.
A TSN-based deep learning method for hand gesture recognition, characterized by .

In a deep learning device based on TSN (Temporal Segment Networks) for real-time hand gesture recognition in videos,
An image zoning unit that segments the entire video according to time series or spatial images in TSN for hand gesture recognition;
An image deep learning unit that performs deep learning on the time series image and deep learning on the spatial image, respectively;
An image integration unit that integrates time series images on which deep learning has been performed and spatial images on which deep learning has been performed; and
A hand gesture recognition unit that performs final classification using the integrated time series images and the integrated spatial images.
Including,
The video zoning unit,
A first zoning unit that divides the video into a plurality of short snippets; and
It includes a second zoning unit for zoning each of the short pieces of information according to a time series image or a spatial image,
The video deep learning unit,
In order to convert each frame, which is a two-dimensional image, into a three-dimensional image in the deep learning process, all filters and pooling kernels are expanded to give a time dimension, thereby converting each frame into a three-dimensional image.
Hand gestures are recognized through a two-stream hand gesture recognition network based on two expanded 3D ResNets, and the short information is selected from each segment in each frame and fuses the class scores of different short information to generate time series images and spatial images. to integrate them
TSN-based deep learning device for hand gesture recognition.

delete