KR20210157271A

KR20210157271A - Dynamic projection mapping interactive gesture recognition media content prodcution method and apparatus

Info

Publication number: KR20210157271A
Application number: KR1020200097369A
Authority: KR
Inventors: 최유주; 김태원; 고유진; 윤현주
Original assignee: 서울미디어대학원대학교 산학협력단; 윤현주
Priority date: 2020-06-19
Filing date: 2020-08-04
Publication date: 2021-12-28
Also published as: KR102349002B1

Abstract

According to an embodiment of the present invention, a method for creating gesture recognition interactive media content in connection with dynamic projection mapping, which is performed by a device, comprises the steps of: (a) reading a configuration file matched with a predetermined gesture with respect to a media particle effect, and generating a motion history image based on a gesture image collected from a camera that photographs the gesture of a user; (b) inputting the motion history image to a gesture inference model, and determining a gesture type of the current state of the user based on the gesture type information output from the gesture inference model; and (c) performing projection mapping on the media particle effect corresponding to each gesture type determined according to changes in the gesture of the user after tracking the location of the user, based on the spatial mapping information between the camera and a projector. The gesture inference model is trained through a convolutional neural network (CNN). Therefore, the method enables a user with no programming experience to easily create interactive media content.

Description

DYNAMIC PROJECTION MAPPING INTERACTIVE GESTURE RECOGNITION MEDIA CONTENT PRODCUTION METHOD AND APPARATUS

본 발명은 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for producing interactive media content with gesture recognition linked to dynamic projection mapping.

프로젝션 맵핑은 투사하는 대상의 평면에 가상의 무대 공간을 창조할 수 있기 때문에 빠른 무대 장면 전환을 통한 이야기 전개가 가능하다. 전통적인 극장 공연 및 다양한 대형 행사의 무대 연출에서 현대의 첨단 기술은 해당 콘텐츠 생산에 핵심적인 요소로 작동하고 있다. 기존의 무대 제작 방식에 비해 창작자의 의도대로 무대 연출이 자유롭고 물리적, 경제적 한계가 덜 하다. 스토리에 따라 쉬는 시간 없이 빠르게 전환이 가능한 무대 연출은 관객의 흥미와 몰입을 강화시키는 요인 중 하나이다. 이에 따라 디지털 사이니지(Digital Signage)와 같은 마케팅 분야를 포함하여 공공장소 디스플레이, 전시 공간, 공연 무대, 패션쇼 등 다양한 분야에서 맵핑을 활용한 융복합 콘텐츠가 퍼져나가고 있다. Because projection mapping can create a virtual stage space on the plane of the projected object, it is possible to develop a story through rapid stage scene change. In traditional theatrical performances and stage production of various large events, modern high-tech technology is a key factor in the production of such content. Compared to the existing stage production method, the stage production is free according to the creator's intention and there are less physical and economic limitations. One of the factors that strengthens the interest and immersion of the audience is that the stage can be changed quickly without a break according to the story. Accordingly, convergence contents using mapping are spreading in various fields such as public space displays, exhibition spaces, performance stages, and fashion shows, including marketing fields such as digital signage.

특히 대표적인 인터랙티브 미디어 콘텐츠 유형으로서, 실시간으로 사물이나 사람의 위치, 동작 등을 추적하여 그에 어울리는 효과를 발생시키는 “제스처 기반 동적 프로젝션 맵핑 콘텐츠” 제작이 지속적으로 시도하고 있다. In particular, as a representative interactive media content type, continuous attempts are being made to produce “gesture-based dynamic projection mapping content” that tracks the location and motion of objects or people in real time and generates appropriate effects.

하지만 동적 프로젝션 맵핑 콘텐츠 제작은 사전에 필요한 준비 과정이 많고 구현 난이도가 높은데다 기존에 배포되어 있는 프로젝션 맵핑 유료 상용화 툴은 제스처 기반의 동적 프로젝션 맵핑을 지원하지 않으며 개발자가 직접 툴을 수정하는 것이 불가능하다. 따라서 공연자의 동작에 따른 효율적인 미디어 이펙트에 대한 처리가 복잡한 프로그래밍 과정 없이 용이하게 이루어질 수 있도록 하는 “인터랙티브 콘텐츠 제작 프레임워크”가 요구되고 있다. However, dynamic projection mapping content creation requires a lot of preparation in advance and is difficult to implement, and the existing paid commercial projection mapping tools do not support gesture-based dynamic projection mapping, and it is impossible for developers to directly modify the tool. . Therefore, there is a demand for an “interactive content creation framework” that enables efficient media effects according to the performer's motion to be easily performed without a complicated programming process.

본 발명은 전술한 문제점을 해결하기 위하여, 사용자의 제스처에 따라 반응하는 인터랙티브 미디어 콘텐츠를 프로그래밍 경험이 없는 사용자가 쉽게 제작할 수 있도록 하는 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 방법 및 장치를 제공하는데 그 목적이 있다. In order to solve the above problems, the present invention provides a method and apparatus for producing interactive media content with dynamic projection mapping interlocking gesture recognition that allows a user without programming experience to easily produce interactive media content that responds to a user's gesture. There is a purpose.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problems to be achieved by the present embodiment are not limited to the technical problems described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명에 따른 장치에 의해 수행되는 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 방법은 (a) 미리 정해진 제스처와 미디어 파티클 효과가 기매칭된 구성파일을 리딩하고, 사용자의 제스처를 촬영하는 카메라로부터 수집된 제스처 이미지에 기초하여 모션 히스토리 이미지를 생성하는 단계; (b) 모션 히스토리 이미지를 제스처 추론 모델에 입력하고, 제스처 추론 모델로부터 출력된 제스처 타입 정보에 기초하여, 사용자의 현재 상태의 제스처 타입을 결정하는 단계; 및 (c) 카메라 및 프로젝터간 공간 매핑 정보에 기초하여, 사용자의 제스처 변화에 따라 결정된 각 제스처 타입에 해당하는 미디어 파티클 효과를 사용자의 위치를 추적하여 프로젝션 매핑하는 단계;를 포함하되, 제스처 추론 모델은 합성곱 신경망(Convolutional neural network, CNN)을 통해 학습된다.As a technical means for achieving the above-mentioned technical problem, the method for producing interactive media content with dynamic projection mapping linked gesture recognition performed by the device according to the present invention is (a) a configuration file in which a predetermined gesture and media particle effect are matched in advance. generating a motion history image based on a gesture image collected from a camera for reading and photographing a user's gesture; (b) inputting a motion history image to a gesture inference model, and determining a gesture type of a user's current state based on gesture type information output from the gesture inference model; And (c) based on the spatial mapping information between the camera and the projector, the media particle effect corresponding to each gesture type determined according to the user's gesture change by tracking the user's location and projection mapping; is learned through a convolutional neural network (CNN).

(a) 단계는 제스처 이미지에 대하여 현재 프레임 시점을 기준으로 기설정된 소정의 프레임 이전부터 현재까지 연속하는 프레임 이미지를 순차적으로 누적시켜 모션 히스토리 이미지를 생성한다.Step (a) generates a motion history image by sequentially accumulating continuous frame images from before a predetermined frame to the present with respect to the gesture image based on the current frame time.

모션 히스토리 이미지를 생성하는 경우, 각 프레임 이미지 별로 사용자의 실루엣 영역을 추출하되, 추출된 사용자 실루엣 정보를 원형큐(circular queue)에 저장하고, 원형큐로부터 과거 시점부터 현재 시점까지 실루엣 정보를 순차적으로 리딩하여 그레이 스케일 이미지로 압축하는 것이되, 각 프레임 이미지는 타임 스탬프에 매칭되는 픽셀값으로 채워진 실루엣 영역과 검은색으로 픽셀값이 채워진 나머지 배경 영역으로 구성된다. When generating a motion history image, the user's silhouette region is extracted for each frame image, the extracted user silhouette information is stored in a circular queue, and the silhouette information from the past time to the present time is sequentially retrieved from the circular queue. It is read and compressed into a grayscale image, but each frame image consists of a silhouette area filled with pixel values matching the time stamp and the remaining background area filled with black pixel values.

(b) 단계는 제스처 추론 모델로부터 출력된 제스처 타입 정보로 제스처 큐(queue)를 구성하고, 제스처 큐를 이용하여 각 프레임 이미지 별 노이즈를 제거한다.In step (b), a gesture queue is configured with the gesture type information output from the gesture inference model, and noise for each frame image is removed using the gesture queue.

동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 장치는 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 방법 프로그램이 저장된 메모리; 메모리에 저장된 프로그램을 실행하는 프로세서를 포함하며, 프로세서는 프로그램의 실행에 의해, 미리 정해진 제스처와 미디어 파티클 효과가 기매칭된 구성파일을 리딩하고, 사용자의 제스처를 촬영하는 카메라로부터 수집된 제스처 이미지에 기초하여 모션 히스토리 이미지를 생성하고, 모션 히스토리 이미지를 제스처 추론 모델에 입력하고, 제스처 추론 모델로부터 출력된 제스처 타입 정보에 기초하여, 사용자의 현재 상태의 제스처 타입을 결정하고, 카메라 및 프로젝터간 공간 매핑 정보에 기초하여, 사용자의 제스처 변화에 따라 결정된 각 제스처 타입에 해당하는 미디어 파티클 효과를 사용자의 위치를 추적하여 프로젝션 매핑하되, 제스처 추론 모델은 합성곱 신경망(Convolutional neural network, CNN)을 통해 학습된다.A dynamic projection mapping-linked gesture recognition interactive media content production apparatus includes: a memory in which a dynamic projection mapping-linked gesture recognition interactive media content production method program is stored; A processor for executing a program stored in the memory, wherein the processor reads a configuration file in which a predetermined gesture and media particle effect are matched in advance by the execution of the program, and records the gesture image collected from a camera that captures the user's gesture. Generate a motion history image based on the motion history image, input the motion history image to the gesture inference model, determine the gesture type of the user's current state based on the gesture type information output from the gesture inference model, and spatial mapping between the camera and the projector Based on the information, the media particle effect corresponding to each gesture type determined according to the user's gesture change is projected by tracking the user's location, but the gesture inference model is learned through a convolutional neural network (CNN) .

프로세서는 제스처 이미지에 대하여 현재 프레임 시점을 기준으로 기설정된 소정의 프레임 이전부터 현재까지 연속하는 프레임 이미지를 순차적으로 누적시켜 모션 히스토리 이미지를 생성한다.The processor generates a motion history image by sequentially accumulating continuous frame images from before a predetermined frame to the present with respect to the gesture image based on the current frame time point.

프로세서는 모션 히스토리 이미지를 생성하는 경우, 각 프레임 이미지 별로 사용자의 실루엣 영역을 추출하되, 추출된 사용자 실루엣 정보를 원형큐(circular queue)에 저장하고, 원형큐로부터 과거 시점부터 현재 시점까지 실루엣 정보를 순차적으로 리딩하여 그레이 스케일 이미지로 압축하는 것이되, 각 프레임 이미지는 타임 스탬프에 매칭되는 픽셀값으로 채워진 실루엣 영역과 검은색으로 픽셀값이 채워진 나머지 배경 영역으로 구성된다.When generating a motion history image, the processor extracts the user's silhouette region for each frame image, stores the extracted user silhouette information in a circular queue, and collects the silhouette information from the past time to the present time from the circular queue. It is read sequentially and compressed into a grayscale image, but each frame image is composed of a silhouette area filled with pixel values matching the time stamp and the remaining background area filled with black pixel values.

프로세서는 제스처 타입을 결정하는 경우, 제스처 추론 모델로부터 출력된 제스처 타입 정보로 제스처 큐(queue)를 구성하고, 제스처 큐를 이용하여 각 프레임 이미지 별 노이즈를 제거한다.When determining the gesture type, the processor configures a gesture queue with gesture type information output from the gesture inference model, and removes noise for each frame image by using the gesture queue.

컴퓨터 판독가능 기록매체는 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 방법을 수행하기 위한 컴퓨터 프로그램이 저장된다. The computer-readable recording medium stores a computer program for performing a method for creating interactive media content for dynamic projection mapping-linked gesture recognition.

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 본 발명은 사용자의 제스처에 따라 반응하는 인터랙티브 미디어 콘텐츠를 프로그래밍 경험이 없는 사용자가 쉽게 제작할 수 있도록 하는 콘텐츠 제작 프레임워크를 제공할 수 있다.The present invention is intended to solve the problems of the prior art, and the present invention can provide a content creation framework that enables a user without programming experience to easily produce interactive media content that responds to a user's gesture.

더불어, 사용자는 제스처와 미디어 효과를 숫자로 표현하고, 텍스트 기반의 구성화일에서 제스처와 이에 대응하는 미디어 효과를 정의함으로써 별도의 프로그래밍 과정 없이 사용자 제스처에 반응하는 인터랙티브 미디어 콘텐츠를 제작할 수 있다.In addition, the user can create interactive media content that responds to user gestures without a separate programming process by expressing gestures and media effects numerically and defining gestures and corresponding media effects in a text-based configuration file.

도 1은 본 발명의 일 실시예에 따른 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 시스템의 구성을 도시한 것이다.
도 2는 본 발명의 일 실시예에 따른 제안하는 인터랙티브 미디어 콘텐츠 제작 시스템의 상세 구조도이다.
도 3은 본 발명의 일 실시예에 따른 제스처 큐를 이용하여 노이즈를 제거하고 최종적으로 제스처 타입을 결정하는 알고리즘를 도시한 것이다.
도 4는 본 발명의 일 실시예에 따른 '발차기-왼쪽‘ 동작에 대한 모션 히스토리 이미지의 나열을 도시한 것이다.
도 5는 본 발명의 일 실시예에 따른 2계층의 전연결 레이어를 포함한 3계층 합성공 신경망을 도시한 것이다.
도 6은 본 발명의 일 실시예에 따른 1계층의 전연결 레이어를 포함한 5계층 합성공 신경망을 도시한 것이다.
도 7은 본 발명의 일 실시예에 따른 목표 제스처 별 모션 히스토리 이미지를 도시한 것이다.
도 8은 본 발명의 일 실시에 따른 각 목표 제스처에 따른 인터랙션 파티클 효과를 도시한 것이다.
도 9는 본 발명의 일 실시예 따른 3종류로 분류하여 학습시킨 오류 동작들의 모션 히스토리 이미지를 도시한 것이다.
도 10은 [표 4] - 8 filters의 실험 결과 그래프이다.
도 11은 [표 5] - condition3의 실험 결과 그래프이다.
도 12는 본 발명의 일 실시예에 따른 제스처에 따른 미디어 효과 적용 화면을 도시한 것이다.
도 13은 본 발명의 일 실시예에 따른 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 방법을 설명하기 위한 순서도이다.1 is a diagram illustrating the configuration of a dynamic projection mapping-linked gesture recognition interactive media content creation system according to an embodiment of the present invention.
2 is a detailed structural diagram of a proposed interactive media content production system according to an embodiment of the present invention.
3 illustrates an algorithm for removing noise and finally determining a gesture type using a gesture queue according to an embodiment of the present invention.
4 is a diagram illustrating an arrangement of motion history images for a 'kick-left' motion according to an embodiment of the present invention.
5 illustrates a 3-layer synthetic convolutional neural network including 2-layer all-connection layers according to an embodiment of the present invention.
6 is a diagram illustrating a 5-layer synthetic convolutional neural network including a 1-layer all-connection layer according to an embodiment of the present invention.
7 illustrates motion history images for each target gesture according to an embodiment of the present invention.
8 illustrates an interaction particle effect according to each target gesture according to an embodiment of the present invention.
9 is a view showing motion history images of erroneous operations learned by classifying into three types according to an embodiment of the present invention.
10 is a graph of the experimental results of [Table 4] - 8 filters.
11 is a graph showing the experimental results of [Table 5] - condition3.
12 illustrates a screen for applying a media effect according to a gesture according to an embodiment of the present invention.
13 is a flowchart illustrating a method for creating interactive media content for gesture recognition linked to dynamic projection mapping according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement them. However, the present invention may be embodied in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

본 발명 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. Throughout the present specification, when a part is "connected" with another part, it is not only "directly connected" but also "electrically connected" with another element interposed therebetween. include

본 발명 명세서 전체에서, 어떤 부재가 다른 부재 “상에” 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout the present specification, when a member is said to be “on” another member, this includes not only a case in which a member is in contact with another member but also a case in which another member is present between the two members.

본 발명 명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함" 한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다. 본 발명 명세서 전체에서 사용되는 정도의 용어 "약", "실질적으로" 등은 언급된 의미에 고유한 제조 및 물질 허용오차가 제시될 때 그 수치에서 또는 그 수치에 근접한 의미로 사용되고, 본 발명의 이해를 돕기 위해 정확하거나 절대적인 수치가 언급된 개시 내용을 비양심적인 침해자가 부당하게 이용하는 것을 방지하기 위해 사용된다. 본 발명 명세서 전체에서 사용되는 정도의 용어 "~(하는) 단계" 또는 "~의 단계"는 "~ 를 위한 단계"를 의미하지 않는다.Throughout the present specification, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. As used throughout this specification, the terms "about", "substantially", etc. are used in or close to the numerical value when manufacturing and material tolerances inherent in the stated meaning are presented, and are used in the meaning of the present invention. It is used to prevent an unscrupulous infringer from using the disclosure in which exact or absolute figures are mentioned for better understanding. As used throughout the present specification, the term "step for (to)" or "step for" does not mean "step for".

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

한편, 기존의 제스처 인식을 위한 딥러닝 기반의 접근법에 대하여 살펴보면, 제스처 인식은 적외선 혹은 이미지 센서와 같은 다양한 센서를 통해 사용자 신체의 움직임을 인식하여 컴퓨터와 상호작용하는 동작 인식 기술들 중 하나이다. 제스처 인식 기술을 위해서는 사용자가 수행한 제스처를 입력 받을 수 있는 센서가 필요하며 두 가지의 형태로 나뉜다. 센서나 장치를 사용자가 신체로 직접 접촉하여 데이터를 획득하는 접촉식 방식과 원거리 및 근거리 센서를 이용하여 데이터를 획득하는 비접촉식 방식이 있다. 최근에는 큰 규모의 데이터 세트 수집이 용이해지고 GPU를 활용한 고성능 컴퓨팅이 보편화됨에 따라 딥러닝(Deep Learning) 기술을 이용하여 제스처를 인식하고자 하는 연구들이 이어지고 있다. On the other hand, looking at the existing deep learning-based approaches for gesture recognition, gesture recognition is one of gesture recognition technologies that interacts with a computer by recognizing the movement of a user's body through various sensors such as infrared or image sensors. For gesture recognition technology, a sensor capable of receiving a user's gesture is required, and it is divided into two types. There is a contact-type method in which a user directly touches a sensor or device with a body to obtain data, and a non-contact method in which data is obtained using remote and near-field sensors. Recently, as large-scale data set collection becomes easy and high-performance computing using GPUs becomes common, studies to recognize gestures using deep learning technology are continuing.

[표 1] 제스처 및 동작 인식을 위한 딥러닝 접근법 [Table 1] Deep Learning Approaches for Gesture and Motion Recognition

예를들면 제스처 인식을 위한 딥러닝 접근법(Gesture recognition approaches)은 [표 1]과 같이 크게 세 분류로 나뉜다. 첫 번째 접근 방법은 합성곱 신경망(Convolutional Neural Network)에서 3차원 필터를 사용하는 방법으로서 공간적 및 시간적 차원에서 차별적인 특성을 추출한다. 두 번째 접근 방법은 2차원 옵티컬 플로우 지도(optical flow map)와 같은 모션 특성(motion features)을 미리 계산하고, 이를 네트워크의 입력으로 사용하는 방법들이다. 세 번째 접근 방법은 RNN(Recurrent Neural Network)과 LSTM(Long Short Term Memory)을 결합하는 등 연속 데이터(Sequence Data)를 활용하는 시간적 방법(Temporal methods)들이다. For example, deep learning approaches for gesture recognition are roughly divided into three categories as shown in [Table 1]. The first approach uses a three-dimensional filter in a convolutional neural network to extract discriminative features in spatial and temporal dimensions. The second approach is to calculate in advance motion features such as a two-dimensional optical flow map and use them as input to the network. The third approach is a temporal method that utilizes sequence data, such as combining a Recurrent Neural Network (RNN) and a Long Short Term Memory (LSTM).

또한, 기존의 동적 프로젝션 맵핑 기반의 인터랙티브 미디어 아트에 대하여 살펴보면, 프로젝션 맵핑(Projection Mapping)은 실제 건물이나 사물의 표면에 알맞은 콘텐츠의 영상을 투영하는 실감형 인터페이스의 미디어아트 기법이다. 미디어아트 분야에서는 프로젝션 기반의 공간 증강현실기법을 프로젝션 맵핑으로 부르고 있다. 프로젝션 맵핑은 미디어 파사드(facade)와 같이 대형 건물에 화려한 영상을 맵핑하거나 전시, 공연 및 마케팅 분야에서 관객 혹은 고객의 흥미와 몰입도를 높이기 위한 장치로 사용되고 있다. 일반적으로 공연예술에서 볼 수 있는 프로젝션 맵핑 기술은 무대의 세트나 연출에 사용되는 미술도구와 같이 공연장의 실내 환경이나 사용되는 오브젝트의 표면을 맵핑하여 시각적인 증강을 표현한다. 최근에는 3D와 4D를 넘어서 360도 돔에까지 스크린으로 활용이 가능할 정도로 기술이 발전된 동시에 VR/AR과 결합되면서 그 활용 가능성이 더욱 커지고 있는 상황이다. 그중 동적 프로젝션 맵핑은 대표적인 인터랙티브 미디어 콘텐츠의 한 유형이다. 무대 위에서 위치가 이동하거나 형태가 바뀌는 등 동적 객체에 프로젝션 기법을 적용하여 다양한 인터랙션 효과를 제공한다. 사물의 움직임에 따라 맵핑되는 영상이 달라지기 때문에 관객들의 몰입도를 높일 수 있다는 점이 동적 프로젝션 맵핑의 강점이다. Also, looking at the existing interactive media art based on dynamic projection mapping, projection mapping is a media art technique of an immersive interface that projects an image of content suitable for the surface of an actual building or object. In the field of media art, the projection-based spatial augmented reality technique is called projection mapping. Projection mapping is used as a device to map colorful images on large buildings such as media façades or to increase the interest and immersion of audiences or customers in exhibitions, performances, and marketing fields. In general, the projection mapping technology seen in the performing arts expresses visual augmentation by mapping the surface of the indoor environment of the performance hall or objects used, such as art tools used for stage sets or directing. Recently, the technology has advanced to the extent that it can be used as a screen beyond 3D and 4D to a 360-degree dome, and at the same time, it is combined with VR/AR, and the possibility of its use is growing. Among them, dynamic projection mapping is one type of representative interactive media content. Various interaction effects are provided by applying the projection technique to dynamic objects, such as moving positions or changing shapes on stage. The strength of dynamic projection mapping is that it can increase the audience's immersion because the mapped image changes according to the movement of an object.

도1은 본 발명의 일 실시예에 따른 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 시스템의 구성을 도시한 것이다. 1 is a diagram illustrating the configuration of a dynamic projection mapping-linked gesture recognition interactive media content creation system according to an embodiment of the present invention.

도 1을 참조하면, 본 발명은 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 장치(100), 카메라(10) 및 프로젝터(20)를 포함한다.Referring to FIG. 1 , the present invention includes a dynamic projection mapping-linked gesture recognition interactive media content creation apparatus 100 , a camera 10 , and a projector 20 .

이하에서는 설명의 편의상 본 발명의 일 실시예에 따른 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 장치(100)를 ‘미디어 콘텐츠 제작 장치(100)’로 간략히 지칭하도록 한다.Hereinafter, for convenience of description, the apparatus 100 for producing interactive media content with dynamic projection mapping linked gesture recognition according to an embodiment of the present invention will be briefly referred to as a 'media content production apparatus 100'.

미디어 콘텐츠 제작 장치(100)는 메모리(110), 통신모듈(120) 및 프로그램(또는 애플리케이션)을 수행하는 프로세서(130) 및 디스플레이부(140)를 포함하여 구성될 수 있다. 여기서 프로세서(130)는 메모리(110)에 저장된 프로그램의 실행에 따라 다양한 기능을 수행할 수 있는데, 각 기능에 따라 프로세서(130)는 세부 모듈들을 포함할 수 있다.The media content production apparatus 100 may include a memory 110 , a communication module 120 , a processor 130 for executing a program (or an application), and a display unit 140 . Here, the processor 130 may perform various functions according to the execution of the program stored in the memory 110 , and the processor 130 may include detailed modules according to each function.

통신모듈(120)은 카메라(10) 및 프로젝터(20)와 각각 데이터 통신을 처리한다. 통신모듈(120)은 통신망과 연동하여 카메라(10) 및 프로젝터(20)로 송수신되는 신호를 패킷 데이터 형태로 제공하는 데 필요한 통신 인터페이스를 제공한다. 여기서, 통신 모듈(120)은 다른 네트워크 장치와 유무선 연결을 통해 제어 신호 또는 데이터 신호와 같은 신호를 송수신하기 위해 필요한 하드웨어 및 소프트웨어를 포함하는 장치일 수 있다.The communication module 120 processes data communication with the camera 10 and the projector 20, respectively. The communication module 120 provides a communication interface necessary to provide signals transmitted and received to and from the camera 10 and the projector 20 in the form of packet data by interworking with a communication network. Here, the communication module 120 may be a device including hardware and software necessary for transmitting and receiving signals such as control signals or data signals through wired/wireless connection with other network devices.

메모리(110)에는 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 서비스를 제공하기 위한 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 방법 프로그램이 저장되어 있고 메모리(110)에 저장된 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 방법 프로그램은 프로세서(130)에 의하여 구동될 수 있다.The memory 110 stores a dynamic projection mapping-linked gesture recognition interactive media content production method program for providing an interactive media content creation service, and the dynamic projection mapping-linked gesture recognition interactive media stored in the memory 110 The content creation method program may be driven by the processor 130 .

또한, 메모리(110)는 프로세서(130)가 처리하는 데이터를 일시적 또는 영구적으로 저장하는 기능을 수행한다. 여기서, 메모리(110)는 휘발성 저장 매체(volatile storage media) 또는 비휘발성 저장 매체(non-volatile storage media)를 포함할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다.In addition, the memory 110 performs a function of temporarily or permanently storing data processed by the processor 130 . Here, the memory 110 may include a volatile storage medium or a non-volatile storage medium, but the scope of the present invention is not limited thereto.

메모리(110)는 프로세서(130)의 처리 및 제어를 위한 운영 체제 등 별도의 프로그램이 저장될 수도 있고, 입력되거나 출력되는 데이터들의 임시 저장을 위한 기능을 수행할 수도 있다.The memory 110 may store a separate program such as an operating system for processing and controlling the processor 130 , or may perform a function for temporarily storing input or output data.

메모리(110)는 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램, 롬 중 적어도 하나의 타입의 저장매체를 포함할 수 있다. The memory 110 may include a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (eg, SD or XD memory, etc.), RAM , ROM may include at least one type of storage medium.

프로세서(130)는 메모리(110)에 저장된 프로그램을 실행하되, 이하에서 설명할 카메라(10) 및 프로젝터(20)의 프로세서를 통해 처리되는 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 방법 프로그램의 각 동작에 대응하는 처리를 수행한다.The processor 130 executes the program stored in the memory 110, and each operation of the program for the dynamic projection mapping interlocking gesture recognition interactive media content production method processed through the processor of the camera 10 and the projector 20 to be described below processing corresponding to

이를 위해 프로세서(130)는 적어도 하나의 프로세싱 유닛(CPU, micro-processor, DSP 등), RAM(Random Access Memory), ROM(Read-Only Memory) 등을 포함하여 구현될 수 있으며, 메모리(110)에 저장된 프로그램을 RAM으로 독출하여 적어도 하나의 프로세싱 유닛을 통해 실행할 수 있다. 또한, 실시예에 따라서 ‘프로세서’ 라는 용어는 ‘컨트롤러’, ‘연산 장치’, ‘제어부’ 등의 용어와 동일한 의미로 해석될 수 있다. To this end, the processor 130 may be implemented including at least one processing unit (CPU, micro-processor, DSP, etc.), a random access memory (RAM), a read-only memory (ROM), and the like, and the memory 110 . The program stored in the RAM may be read into the RAM and executed through at least one processing unit. Also, according to an embodiment, the term 'processor' may be interpreted as the same meaning as terms such as 'controller', 'arithmetic unit', and 'controller'.

프로세서(130)는 미리 정해진 제스처와 미디어 파티클 효과가 기매칭된 구성파일을 리딩하고, 사용자의 제스처를 촬영하는 카메라(10)로부터 수집된 제스처 이미지에 기초하여 모션 히스토리 이미지를 생성하고, 모션 히스토리 이미지를 제스처 추론 모델에 입력하고, 제스처 추론 모델로부터 출력된 제스처 타입 정보에 기초하여, 사용자의 현재 상태의 제스처 타입을 결정하고, 카메라(10) 및 프로젝터(20)간 공간 매핑 정보에 기초하여, 사용자의 제스처 변화에 따라 결정된 각 제스처 타입에 해당하는 미디어 파티클 효과를 사용자의 위치를 추적하여 프로젝션 매핑하되, 제스처 추론 모델은 합성곱 신경망(Convolutional neural network, CNN)을 통해 학습될 수 있다.The processor 130 reads a configuration file in which a predetermined gesture and media particle effect are matched in advance, and generates a motion history image based on the gesture image collected from the camera 10 for photographing the user's gesture, and the motion history image is input to the gesture inference model, and based on the gesture type information output from the gesture inference model, the gesture type of the user's current state is determined, and based on spatial mapping information between the camera 10 and the projector 20, the user The projection mapping of the media particle effect corresponding to each gesture type determined according to the gesture change of the user by tracking the location of the user, the gesture inference model can be learned through a convolutional neural network (CNN).

예시적으로, 미리 정해진 제스처는 ‘기 모으기’, ‘장풍 쏘기- 왼쪽’, ‘장풍 쏘기- 오른쪽’, ‘발차기- 왼쪽’, ‘발차기- 오른쪽’을 포함하며, 각 제스처에 대한 구체적인 설명은 도7를 참조하여 후술하도록 한다.Illustratively, the predetermined gesture includes 'gathering wind', 'shooting long wind-left', 'shooting long wind-right', 'kicking-left', and 'kicking-right', and detailed descriptions of each gesture will be described later with reference to FIG.

프로세서(130)는 제스처 이미지에 대하여 현재 프레임 시점을 기준으로 기설정된 소정의 프레임 이전부터 현재까지 연속하는 프레임 이미지를 순차적으로 누적시켜 모션 히스토리 이미지를 생성할 수 있다.The processor 130 may generate a motion history image by sequentially accumulating continuous frame images from before a predetermined frame to the present with respect to the gesture image based on the current frame time.

프로세서(130)는 모션 히스토리 이미지를 생성하는 경우, 각 프레임 이미지 별로 사용자의 실루엣 영역을 추출하되, 추출된 사용자 실루엣 정보를 원형큐(circular queue)에 저장하고, 원형큐로부터 과거 시점부터 현재 시점까지 실루엣 정보를 순차적으로 리딩하여 그레이 스케일 이미지로 압축할 수 있다. 이때, 각 프레임 이미지는 타임 스탬프에 매칭되는 픽셀값으로 채워진 실루엣 영역과 검은색으로 픽셀값이 채워진 나머지 배경 영역으로 구성될 수 있다.When the motion history image is generated, the processor 130 extracts the user's silhouette region for each frame image, stores the extracted user silhouette information in a circular queue, and from the circular queue from the past time to the present time The silhouette information can be sequentially read and compressed into a gray scale image. In this case, each frame image may include a silhouette area filled with pixel values matching the time stamp and the remaining background area filled with black pixel values.

프로세서(130)는 제스처 타입을 결정하는 경우, 제스처 추론 모델로부터 출력된 제스처 타입 정보로 제스처 큐(queue)를 구성하고, 제스처 큐를 이용하여 각 프레임 이미지 별 노이즈를 제거할 수 있다.When determining the gesture type, the processor 130 may configure a gesture queue with gesture type information output from the gesture inference model, and remove noise for each frame image by using the gesture queue.

이하 도2를 참조하여, 본 발명의 일 실시예에 따른 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 방법을 설명하고자 한다.Hereinafter, with reference to FIG. 2 , a method for creating interactive media content with dynamic projection mapping-linked gesture recognition according to an embodiment of the present invention will be described.

도2는 본 발명의 일 실시예에 따른 제안하는 인터랙티브 미디어 콘텐츠 제작 시스템의 상세 구조도이다.2 is a detailed structural diagram of a proposed interactive media content production system according to an embodiment of the present invention.

예시적으로, 도2를 참조하면, 미디어 콘텐츠 제작 장치(100)는 C++과 파이썬(Python) 기반의 텐서플로우(Tensorflow)로 구현될 수 있다. For example, referring to FIG. 2 , the apparatus 100 for producing media content may be implemented with C++ and Python-based Tensorflow.

프로세서(130)는 각 기능을 수행하는 제1모듈(210), 제2모듈(220), 제3모듈(230), 제4모듈(240), 제5모듈(250)을 포함할 수 있다. 제1모듈(210)은 구성파일(Configuration file)로부터 지정 제스처와 미디어 콘텐츠 효과간의 연결 내용을 읽어 들이고(구성파일 리더부), 실시간으로 입력되는 카메라 영상(사용자 제스처 이미지 데이터)을 기반으로 모션 히스토리 이미지(MHI)를 생성할 수 있다. 이어서, 인식된 사용자 제스처 타입에 따른 상태변이(state transition)를 관리하고 이를 제3모듈(230)에 전달하는 메인 제어 클래스(ofApp)의 역할을 수행할 수 있다. The processor 130 may include a first module 210 , a second module 220 , a third module 230 , a fourth module 240 , and a fifth module 250 performing respective functions. The first module 210 reads the connection content between the specified gesture and the media content effect from the configuration file (configuration file reader), and based on the camera image (user gesture image data) input in real time, motion history An image (MHI) can be created. Subsequently, it may serve as a main control class (ofApp) that manages a state transition according to the recognized user gesture type and transmits it to the third module 230 .

제2모듈(220)은 제1모듈(210)로부터 모션히스토리 이미지를 받아서 텐서플로우의 제스처 추론 모델에 전달하고, 제스처 추론 모델로부터 현재 프레임에 대한 추론된 제스처 타입을 받아서 제스처 큐를 구성하며, 제스처 큐의 구성 내용에 따라 최종적으로 현재의 제스처 타입을 판정하는 클래스(Recognizer)의 역할을 수행할 수 있다. The second module 220 receives the motion history image from the first module 210 and transmits it to the gesture inference model of TensorFlow, receives the gesture type inferred for the current frame from the gesture inference model, and configures a gesture queue, According to the configuration of the queue, it can serve as a class (recognizer) that finally determines the current gesture type.

제3모듈(230)은 제1모듈(210)로부터 제스처 변화에 따른 현재 상태와 사용자 위치에 동적으로 미디어 효과를 프로젝션 매핑하기 위한 카메라-프로젝터간 공간 매핑 정보를 전달받고, 적절한 파티클 효과를 렌더링해 주는 파티클 시스템 클래스(Particle System)의 역할을 수행할 수 있다. The third module 230 receives, from the first module 210, camera-projector spatial mapping information for dynamically projection mapping media effects to the current state and user location according to gesture changes, and renders appropriate particle effects. Note can play the role of a particle system class (Particle System).

제4모듈(230)은 실시간 제스처 인식 처리를 수행하기 전의 사전 처리 단계로서, 각 제스처별로 수집된 모션 히스토리 이미지 데이터셋을 읽어 들이고 학습을 수행하여, 제스처 인식을 위한 PB(Protobuf) 파일을 생성하는 합성곱 신경망 모델(Gesture Training)이다. The fourth module 230 is a pre-processing step before real-time gesture recognition processing, which reads the motion history image dataset collected for each gesture and performs learning to generate a PB (Protobuf) file for gesture recognition. It is a convolutional neural network model (Gesture Training).

마지막으로 제5모듈(250)는 생성된 PB 파일을 기반으로 하여 실시간으로 모션 히스토리 이미지 입력받아 매칭되는 제스처 타입을 추론하는 텐서플로우 모듈(Gesture Inference)이다.Finally, the fifth module 250 is a TensorFlow module (Gesture Inference) that infers a matching gesture type by receiving a motion history image in real time based on the generated PB file.

프로세서(130)는 현재 프레임 시점을 기준으로 30 프레임 이전부터의 프레임이미지를 처리하여 한 장의 모션 히스토리 이미지(MHI)를 생성할 수 있다. The processor 130 may generate a single motion history image MHI by processing a frame image from 30 frames before the current frame time.

여기서 모션 히스토리 이미지는 각 프레임 영상(사용자의 제스처 이미지 데이터)에서 사용자의 실루엣을 추출하고(바디 추적부), 사용자 실루엣 영역은 현재 타임스탬프에 매칭되는 픽셀값으로 채우고, 배경영역은 검은색으로 픽셀값을 채운 결과를 현재 프레임까지 완성된 모션 히스토리 이미지에 누적시켜 나가는 방식으로 생성될 수 있다. 한편, 한 장으로 표현된 모션 히스토리를 기반으로 합성곱 신경망을 통하여 실행하는 제스처 인식은 100프로의 인식률을 보이지 못한다. Here, for the motion history image, the user's silhouette is extracted from each frame image (user's gesture image data) (body tracking unit), the user silhouette area is filled with pixel values matching the current timestamp, and the background area is black in pixels. It can be generated by accumulating the result of filling the values in the motion history image completed up to the current frame. On the other hand, gesture recognition performed through a convolutional neural network based on the motion history expressed in one sheet does not show a recognition rate of 100%.

이에 따라, 프로세서(130)는 실시간 입력되는 영상을 기반으로 안정적 제스처 인식을 수행하기 위하여 텐서플로우 추론 모델을 통해 얻은 결과 값을 바로 현재의 제스처 타입으로 판정하지 않고, 제스처 큐(queue)를 구성하여, 프레임별로 노이즈(에러상황)로 인식된 상황을 걸러낼 수 있다. 이때 제스처 큐는 세션에서 결과 값을 리턴 할 때마다 반복 수행할 수 있다. Accordingly, the processor 130 configures a gesture queue without directly determining the result value obtained through the TensorFlow inference model as the current gesture type in order to perform stable gesture recognition based on the real-time input image. , it is possible to filter the situation recognized as noise (error situation) for each frame. In this case, the gesture queue can be repeatedly performed whenever a result value is returned from the session.

도3은 본 발명의 일 실시예에 따른 제스처 큐를 이용하여 노이즈를 제거하고 최종적으로 제스처 타입을 결정하는 알고리즘를 도시한 것이다.3 illustrates an algorithm for removing noise and finally determining a gesture type using a gesture queue according to an embodiment of the present invention.

도3에 도시된 알고리즘 1은 제스처 큐를 이용하여 노이즈를 제거하고 최종적으로 제스처 타입을 결정하는 알고리즘이다.Algorithm 1 shown in FIG. 3 is an algorithm for removing noise using a gesture queue and finally determining a gesture type.

먼저 1행 에서는 텐서플로우 모듈을 통해 리턴 받은 제스처 타입을 계속 제스처 큐에 푸쉬한다. 2행과 11행을 보면 queue의 사이즈에 따라 리턴 값이 달라지는데, 제스처 큐의 사이즈가 4보다 작으면 제스처와 상관없는 값을 리턴하여 제스처 타입을 제공하지 않고 제스처 큐의 사이즈가 4가 될 때부터 그중 가장 많이 리턴된 제스처 타입을 찾는 과정을 수행한다. 타겟 제스처 타입의 개수와 크기가 같은 int 배열 arr을 선언하고, 값을 모두 0으로 초기화 한다. 4, 5행에서는 제스처 큐에 들어가 있는 제스처 타입별 개수를 카운트한다. 6~9행의 for문에서 사용되는 변수 i는 타겟 제스처의 인덱스 값이다. 7행은 arr[i]의 값이 제스처 큐의 과반수를 차지하게 되는 경우를 의미한다. 8 ~ 9행에서는 제스처 큐의 원소 하나를 팝(pop)하고 제스처 큐에서 과반이 넘은 제스처 타입 i를 리턴한다. 9행에서는 제스처 큐의 원소를 하나 팝처리 한다. 10행까지 진행이 되었다는 것은 타겟으로 설정된 제스처 타입이 제스처 큐 내에서 과반수를 차지하지 못했다는 의미로, 제스처 큐의 원소 하나를 팝처리 하고 ‘현재 프레임에서 제스처를 결정하지 못했다’라는 의미의 값을 리턴한다.First, in line 1, the gesture type returned through the TensorFlow module is continuously pushed to the gesture queue. If you look at lines 2 and 11, the return value varies depending on the size of the queue. If the size of the gesture queue is less than 4, it returns a value irrelevant to the gesture and does not provide the gesture type and starts when the size of the gesture queue becomes 4. Among them, the process of finding the most returned gesture type is performed. Declare an int array arr with the same number and size of target gesture types, and initialize all values to 0. In lines 4 and 5, the number of each gesture type in the gesture queue is counted. The variable i used in the for statement in lines 6 to 9 is the index value of the target gesture. Line 7 means that the value of arr[i] occupies a majority of the gesture queue. In lines 8 to 9, one element of the gesture queue is popped and the gesture type i, which is more than half of the gesture queue, is returned. In line 9, one element of the gesture queue is popped. Proceeding to line 10 means that the gesture type set as the target did not occupy a majority in the gesture queue. return

텍스트로 구성되는 구성 파일(Configuration File)에는 제스처와 미디어 효과를 연결하는 인터랙티브 미디어 콘텐츠가 정의될 수 있다.Interactive media content linking gestures and media effects may be defined in a configuration file composed of text.

예를 들어, 각 동작과 매칭되는 동적 미디어 효과는 openFramework를 기반으로 제작될 수 있다. openFrameworks은 C++ 툴킷으로 직관적인 프레임워크를 통해 콘텐츠 제작을 돕는 오픈소스이다. [표 2]는 구성 파일에서 총 제스처의 개수와 파티클 효과의 개수, 판별된 제스처의 결과값을 저장하는 디렉토리 등 각각의 제스처와 그에 맞는 파티클 효과 매칭을 정의하는 규칙을 보여준다.For example, dynamic media effects matching each motion can be created based on openFramework. openFrameworks is an open source C++ toolkit that helps content creation through an intuitive framework. [Table 2] shows the rules for defining each gesture and particle effect matching corresponding to each gesture, such as the total number of gestures and the number of particle effects in the configuration file, and a directory that stores the result values of the determined gestures.

[표 2] 제스처와 파티클 효과 매칭을 위한 구성파일 규칙 예[Table 2] Example of configuration file rules for matching gestures and particle effects

파티클 효과는 총 4가지로, 1번 파티클 효과는 제스처 매칭 효과가 아닌 카메라 앞에 사용자의 존재유무를 판별하여 맵핑할 때 쓰인다. There are a total of four particle effects, and the first particle effect is used for mapping by determining the presence of a user in front of the camera rather than a gesture matching effect.

도 4는 본 발명의 일 실시예에 따른 '발차기-왼쪽‘ 동작에 대한 모션 히스토리 이미지의 나열을 도시한 것이다.4 shows a list of motion history images for a 'kick-left' motion according to an embodiment of the present invention.

도 4를 참조하면, 본 발명은 입력데이터로 모션 히스토리 이미지를 사용하며, 모션 히스토리 이미지는 움직임이나 제스처의 진행 경로를 실루엣 이미지로 한눈에 확인할 수 있는 정적 이미지 템플릿을 의미한다. 일반적으로 제스처 인식을 위해서는 연속 프레임 이미지들을 사용하기 때문에, 2D 이미지를 이용한 객체판별(object recognition)의 경우보다 사용되는 메모리 크기가 커지고 계산량도 증가된다. 도 4는 본 발명에서 학습 실험을 진행한 ‘발차기’ 동작의 모션 히스토리 이미지를 순차적으로 나열한 것이다. 모션 히스토리 이미지를 생성하기 위해서는 두 단계의 절차를 거친다. 첫째로, 각 프레임별로 사용자의 실루엣을 추출한다. 사용자 실루엣 이미지는 30개 프레임의 사용자 실루엣 정보를 저장하는 원형큐(circular queue)에 저장한다. 둘째로, 실루엣 원형큐의 과거 시점부터 현재 시점까지 실루엣 이미지를 순차적으로 읽어들여 실루엣 영역의 픽셀값을 읽고, 이를 모션 히스토리 이미지 공간에서 매칭되는 픽셀의 픽셀값으로 대체한다. 사용자 실루엣을 이용하여 표현된 모션 시퀀스는 동작정보를 저용량의 그레이 스케일 이미지로 압축하여 표현하게 한다. Referring to FIG. 4 , the present invention uses a motion history image as input data, and the motion history image refers to a static image template that can check the progress path of a movement or gesture as a silhouette image at a glance. In general, since continuous frame images are used for gesture recognition, the size of memory used is larger and the amount of calculation is increased than in the case of object recognition using 2D images. 4 is a sequential list of motion history images of the 'kick' operation in which the learning experiment was conducted in the present invention. To create a motion history image, there are two steps. First, a user's silhouette is extracted for each frame. The user silhouette image is stored in a circular queue that stores user silhouette information of 30 frames. Second, the silhouette images are sequentially read from the past time point to the present time point of the silhouette circular cue, the pixel value of the silhouette area is read, and the pixel value is replaced with the pixel value of the matching pixel in the motion history image space. The motion sequence expressed using the user's silhouette compresses and expresses motion information into a low-capacity gray-scale image.

도 5는 본 발명의 일 실시예에 따른 2계층의 전연결 레이어를 포함한 3계층 합성공 신경망을 도시한 것이다.5 illustrates a 3-layer synthetic convolutional neural network including 2-layer all-connection layers according to an embodiment of the present invention.

도6은 본 발명의 일 실시예에 따른 1계층의 전연결 레이어를 포함한 5계층 합성공 신경망을 도시한 것이다.6 is a diagram illustrating a 5-layer synthetic convolutional neural network including a 1-layer all-connection layer according to an embodiment of the present invention.

예를 들어, 본 발명의 일 실시예에 따른 제스처 인식을 위한 합성곱 신경망 설계의 경우, 파이썬 기반의 텐서플로우로 구현한 제스처 학습 모듈을 통해 제스처 인식 학습을 진행하였다. 실험은 Windows 10 운영체재, CPU RAM 64.0GB 그리고 NVIDIA 11GByte 메모리를 가지는 GeForce GTX 1080 Ti의 GPU 환경에서 진행되었다. 파이썬 버전은 3.7으로 설정하였다. 실험에 쓰인 합성곱 신경망은 도5의 3계층의 합성곱 레이어와 2계층의 전연결 (Fully-connected) 레이어로 구성되는 모델과 도6의 5계층의 합성곱 레이어와 1계층의 전연결 (Fully-connected) 레이어로 구성되는 모델을 구성하여 실험하였다. 3계층의 합성공 신경망에서는 합성곱 레이어의 필터수를 ‘X’라 표기하고, 각 레이어에서 모두 32, 16, 8 개의 필터를 적용하는 3가지 버전으로 학습을 진행하였다. 그리고 모든 레이어에서 5x5 필터를 적용하였다. For example, in the case of designing a convolutional neural network for gesture recognition according to an embodiment of the present invention, gesture recognition learning was performed through a gesture learning module implemented with Python-based TensorFlow. The experiment was conducted in a GPU environment of GeForce GTX 1080 Ti with Windows 10 operating system, CPU RAM 64.0GB and NVIDIA 11GByte memory. The Python version is set to 3.7. The convolutional neural network used in the experiment is a model composed of a 3-layer convolutional layer and 2-layer fully-connected layer in FIG. 5, and a 5-layer convolutional layer and a fully-connected layer in FIG. -connected) layers were constructed and tested. In the three-layer synthetic neural network, the number of filters in the convolutional layer is denoted as ‘X’, and learning was carried out in three versions in which 32, 16, and 8 filters were applied in each layer. And a 5x5 filter was applied to all layers.

5계층 합성곱 신경망의 경우, 모든 합성곱 레이어에 스트라이드(stride)의 크기를 2x2로 적용하고 각 계층의 필터의 수를 차례로 16, 32, 64, 128, 256으로 설정하여 학습하였다. 도6은 제안 플랫폼에 적용한 5계층 합성곱 신경망 구조를 보여주고 있다.In the case of a 5-layer convolutional neural network, the stride size is applied to all convolutional layers as 2x2 and the number of filters in each layer is set to 16, 32, 64, 128, and 256 in sequence. 6 shows the structure of a 5-layer convolutional neural network applied to the proposed platform.

미디어 콘텐츠 제작 장치(100)는 카메라(10)로부터 영상을 입력 받아 모션 히스토리 이미지를 생성하고, 미디어 파티를 효과를 렌더링하는 인터페이스 표시하는 디스플레이부(140)를 포함할 수 있다.The media content production apparatus 100 may include a display unit 140 that receives an image from the camera 10 , generates a motion history image, and displays an interface for rendering a media party effect.

일 예로, 본 발명의 일 실시예에 따른 미디어 콘텐츠 제작 장치(100)는 카메라(10)로부터 영상을 입력 받아 모션 히스토리 이미지를 생성하는 모듈과 미디어 파티클 효과를 렌더링하는 인터페이스 전반적인 부분은 C++ 구현될 수 있다. 모션 히스토리 이미지 한 장에 따른 적합한 제스처 타입 인식 모듈은 텐서플로우로 구현되고, C++모듈과 텐서플로우 모듈을 연결하기 위하여 ‘boost 파이썬’ 라이브러리가 사용될 수 있다. 또한, 미디어 콘텐츠 제작 장치(100)는 ‘boost 파이썬’ 라이브러리를 통해 C++에서 pb파일로 된 제스처 학습 결과를 노드 그래프(node graph)로 가져와 연결하는 방식으로 수행될 수 있다. Boost는 1.69ver, Conda 4.6.15ver, Python 3.7.3ver 그리고 openFramework는 0.9.8ver으로 설정하여 수행될 수 있다.For example, in the media content production apparatus 100 according to an embodiment of the present invention, a module for generating a motion history image by receiving an image from the camera 10 and an overall interface for rendering a media particle effect may be implemented in C++ have. A suitable gesture type recognition module according to one motion history image is implemented in TensorFlow, and the ‘boost Python’ library can be used to connect the C++ module and the TensorFlow module. In addition, the media content production apparatus 100 may be performed by importing the gesture learning result in a pb file from C++ through the 'boost Python' library to a node graph and connecting it. Boost can be performed by setting 1.69ver, Conda 4.6.15ver, Python 3.7.3ver, and openFramework to 0.9.8ver.

이하, 본 발명의 일 실시예에 따른 미디어 콘텐츠 제작 장치(100)의 적용 실험을 설명하고자 한다.Hereinafter, an application experiment of the apparatus 100 for producing media content according to an embodiment of the present invention will be described.

적용 실험을 위해, 목표 제스처들은 실제 프로젝션 맵핑 콘텐츠 제작 시 자주 등장하는 직관적 제스처들로 구성하였다. 첫 번째는 내 안에 숨어있는 스트레스를 모으는 제스처, 두 번째는 모아진 스트레스를 바깥으로 내보내는 제스처, 마지막으로 세 번째는 남은 감정의 찌꺼기를 발로 차는 제스처이다. 장풍 쏘기와 발차기 제스처의 경우, 사용자에 따라 움직이는 방향이 다르다는 점을 고려하여 왼쪽과 오른쪽을 구분하여 총 5가지 제스처의 이미지 데이터를 수집하였다. For the application experiment, the target gestures were composed of intuitive gestures that appear frequently in the production of actual projection mapping content. The first is a gesture to collect the stress hidden inside me, the second is a gesture to let the accumulated stress out, and the third is a gesture to kick the leftover emotions. In the case of long wind shooting and kicking gestures, considering that the direction of movement differs depending on the user, image data of a total of five gestures was collected by dividing the left and right sides.

도 7은 본 발명의 일 실시예에 따른 목표 제스처 별 모션 히스토리 이미지를 도시한 것이다.7 illustrates motion history images for each target gesture according to an embodiment of the present invention.

도7을 참조하면, 각 목표 제스처는 하며, 인식하고자 하는 목표 제스처의 동작 흐름을 모션 히스토리 이미지로 표현하고 각 제스처에 대한 설명을 보여준다. Referring to FIG. 7 , each target gesture is performed, the motion flow of the target gesture to be recognized is expressed as a motion history image, and descriptions of each gesture are shown.

예시적으로, 각 목표 제스처는 ‘기 모으기’ 제스처인 양팔을 들며 천천히 원을 그리는 동작, ‘장풍 쏘기- 왼쪽’ 제스처인 손바닥에서 무언가를 밀어내듯 팔을 왼쪽으로 뻗는 동작, ‘장풍 쏘기- 오른쪽’ 제스처인 손바닥에서 무언가를 밀어내듯 팔을 오른쪽으로 뻗는 동작, ‘발차기- 왼쪽’ 제스처인 발을 왼쪽으로 뻗는 동작, ‘발차기- 오른쪽’ 제스처인 발을 오른쪽으로 뻗는 동작을 포함한다.Illustratively, each target gesture is a gesture of raising the arms and slowly drawing a circle while raising the arms, the gesture of 'Shoot long wind - left', extending the arm to the left as if pushing something out of the palm, and 'Shoot long wind - right'. These include the gesture of extending the arm to the right as if pushing something from the palm, the 'kick-left' gesture of extending the foot to the left, and the 'kick-right' gesture of extending the foot to the right.

도 8은 본 발명의 일 실시에 따른 각 목표 제스처에 따른 인터랙션 파티클 효과이다. 일 예로, 파티클 효과(particle effect)는 openFrameworks를 기반으로 제작되었다. 8 is an interaction particle effect according to each target gesture according to an embodiment of the present invention. As an example, particle effects were created based on openFrameworks.

한편, 이해하기 쉬운 직관적 제스처만큼 중요한 것이 각 제스처에 대한 인터랙션 효과이다. 적절한 인터랙션 효과는 제스처 자체가 가지는 의미를 더 부각시키고 공연자가 전달하고자 하는 메시지를 확실하게 만든다. On the other hand, as important as intuitive gestures that are easy to understand are the interaction effects for each gesture. Appropriate interaction effect further emphasizes the meaning of the gesture itself and makes the message that the performer wants to convey.

예시적으로 사용자가 지정된 위치에 서면 카메라(10)가 사람을 인식하고, 프로젝터(20)는 사용자의 주변에 도 8의 사용자 인식에 해당하는 파티클 효과를 뿌려준다. ‘기 모으기’ 제스처를 시작하면 주변에 발산하던 빛들이 사용자의 중심으로 모여 뭉쳐지는 효과가 맵핑된다. ‘장풍 쏘기’는 공연자의 두 손 중심에서 손을 뻗는 방향으로 파티클 덩어리가 발사되어 실제로 사용자의 손에서 빔이 발사되는 것과 같은 효과를 준다. ‘발차기’ 역시 동작 마지막 시점에 발끝이 향하는 방향으로 불꽃 형태의 파티클이 발사되어 터지는 효과이다. 실제로 발끝에서 무언가 터지는 직관적 효과를 주어 관람객들의 빠른 이해를 돕는다.Exemplarily, when the user stands in a designated position, the camera 10 recognizes a person, and the projector 20 sprays a particle effect corresponding to the user recognition of FIG. 8 around the user. When you start the ‘Raise Raise’ gesture, the effect in which the light emitted from the surroundings gathers at the user's center is mapped. ‘Filling the long wind’ gives the same effect as if a beam is actually fired from the user’s hands by firing a mass of particles from the center of the performer’s hands in the direction they reach out. ‘Kick’ is also an effect that explodes by firing particles in the form of flames in the direction the toes are facing at the end of the movement. In fact, it gives the intuitive effect of popping something from the toe, helping the audience to understand it quickly.

예시적으로, 본 발명의 일 실시예에 따른 학습 및 테스트 데이터셋 구축을 설명하면, 5 layered CNN에 자체적으로 제작한 데이터셋을 활용하여 모션 히스토리 이미지 기반의 제스처 인식 실험을 진행하였다. 수집된 모션 히스토리 이미지는 30fps으로 촬영되어 320x240픽셀 크기의 3채널의 이미지로 저장하였다. 총 7명의 참가자를 대상으로 각 제스처에 대한 이미지 데이터를 확보하였다. 5가지의 목표 제스처와 3가지의 오류 동작들로 나누어 총 8가지의 제스처 이미지 데이터를 수집하였다. 표 3은 실험에 사용된 데이터셋의 구성이다. 학습 데이터 세트와 검증 데이터 세트의 비율은 약 7:3으로 구성되어있다.Illustratively, when explaining the construction of a learning and test dataset according to an embodiment of the present invention, a motion history image-based gesture recognition experiment was performed using a dataset produced by the 5-layered CNN. The collected motion history images were captured at 30 fps and stored as 3 channels of 320x240 pixel size. Image data for each gesture was obtained for a total of 7 participants. A total of 8 types of gesture image data were collected by dividing them into 5 types of target gestures and 3 types of error motions. Table 3 shows the composition of the dataset used in the experiment. The ratio of the training data set and the validation data set is about 7:3.

[표 3] 학습 및 실험 데이터셋의 구성[Table 3] Composition of training and experimental datasets

5가지의 제스처 이미지 데이터 외의 오류 제스처들은 애플리케이션 체험 시 다양하게 나타나는 사용자들의 변수 동작들에 관한 데이터이다. Error gestures other than the five kinds of gesture image data are data on variable actions of users that appear in various ways during application experience.

도 9는 본 발명의 일 실시예 따른 3종류로 분류하여 학습시킨 오류 동작들의 모션 히스토리 이미지이다.9 is a motion history image of erroneous operations that are learned by classifying them into three types according to an embodiment of the present invention.

첫 번째 오류 제스처는 ‘Black’으로 모션 히스토리가 거의 잡히지 않은 검정 화면들을 수집하였다. 이는 아직 사용자가 등장하지 않았음에도 반응할 수 있는 가능성을 배제하기 위해 추가하였다. 두 번째 오류 제스처는 ‘etc’로 목표 제스처와 관계없는 큰 동작들의 모션 히스토리 이미지들을 담았다. 가장 범위가 큰 분류로써 사용자가 본 발명에서 제안하는 플랫폼을 사용할 때 자연스럽게 발생하는 대부분의 무의미한 제스처들이 포함된다. 마지막으로 세 번째 오류 제스처는 ‘ready’이다. 목표 제스처와 관계없는 작은 동작들을 모은 것으로 주로 목표하는 제스처를 수행하기 이전에 나오는 미세한 제스처들이 해당된다. 이 제스처들은 실제로 의미 있는 목표 동작으로 이어질 가능성이 높은 움직임이어서 데이터 구분 시 주의해야 하는 오류 데이터이다. The first error gesture was ‘Black’, and black screens with little motion history were collected. This was added to rule out the possibility that the user may react even though the user has not yet appeared. The second error gesture is ‘etc’, which contains motion history images of large movements that are not related to the target gesture. As the classification with the largest range, most meaningless gestures that occur naturally when a user uses the platform proposed in the present invention are included. Finally, the third error gesture is ‘ready’. It is a collection of small actions that are not related to the target gesture, and mainly includes fine gestures that appear before performing the target gesture. Since these gestures are movements that are highly likely to lead to meaningful target motions, they are error data that should be paid attention to when classifying data.

실험 환경 및 실험결과에 대하여 설명하면, 제스처 인식 학습 및 제안 플랫폼 실험의 PC환경은 3장의 설계된 CNN을 통해 제스처 인식률 실험자와 카메라간의 거리는 2.25m이고 카메라와 프로젝터의 거리는 1.35m로 고정하여 실험을 진행하였다. 최대한 다양한 신체조건을 수용할 수 있도록 카메라의 높이는 0.8m로 고정하였다. 실험에 참여하는 사용자는 카메라는 정면으로 마주보는 위치에서 실험을 진행한다. 제안 플랫폼을 통한 제스처 인식 및 효과 확인 실험에는 깊이 카메라인 kinect V2와 해상도 5000안시의 고광량 빔프로젝터를 사용하였다. When explaining the experimental environment and experimental results, the PC environment of the gesture recognition learning and proposal platform experiment is carried out by fixing the distance between the experimenter and the camera at 2.25 m and the distance between the camera and the projector at 1.35 m through the CNN designed in Chapter 3 did The height of the camera was fixed at 0.8m to accommodate as many different physical conditions as possible. The user participating in the experiment conducts the experiment in a position where the camera faces the front. For gesture recognition and effect confirmation experiments through the proposed platform, a depth camera, kinect V2, and a high-light beam projector with a resolution of 5000 ANSI were used.

CNN 구조 별 실험결과를 설명하면, [표 4]는 3 layered CNN을 사용하여 진행한 실험에서 각 필터 수 별 가장 높은 성능을 보인 결과값을 보여준다. 8 filter였을 때 높은 test accuracy를 얻었지만 train accuracy의 평균 성능과의 오차와 런타임의 속도를 줄이기 위해 5 layered CNN 버전으로 재실험을 진행하였다. When explaining the experimental results for each CNN structure, [Table 4] shows the results showing the highest performance for each number of filters in an experiment conducted using a 3-layered CNN. When it was 8 filter, high test accuracy was obtained, but in order to reduce the error with the average performance of train accuracy and the speed of runtime, the experiment was performed again with the 5-layered CNN version.

[표4] 3 layered CNN의 인식률과 런타임 결과[Table 4] Recognition rate and runtime result of 3-layered CNN

아래의 [표 5]는 5 layered CNN 구조에서 런타임 속도를 줄이기 위해 stride size와 filter size 등의 조건(condition, con)을 변경하여 학습 실험을 진행한 결과이다. [Table 5] below shows the results of learning experiments by changing conditions (conditions, con) such as stride size and filter size to reduce the runtime speed in the 5-layered CNN structure.

[표5] stride size, filter size, learning rate and batch size에 따른 5 layered CNN의 인식률과 런타임 결과[Table 5] Recognition rate and runtime result of 5-layered CNN according to stride size, filter size, learning rate and batch size

실험 결과, 4번째와 5번째 convolution layer(conv)에서 stride size를 1x1에서 2x2로 변경했을 시 런타임이 눈에 띄게 줄어든 것을 확인할 수 있었다. 이후 full stride를 적용한 상태로 filter size의 일부를 변경하여 재실험을 했을 때 조금 더 줄어든 런타임과 이전보다 안정화된 test accuracy와 train accuracy를 얻었다. As a result of the experiment, it was confirmed that the runtime was noticeably reduced when the stride size was changed from 1x1 to 2x2 in the 4th and 5th convolution layers (conv). After that, when a part of the filter size was changed with full stride applied and a re-experiment was performed, a slightly reduced runtime and more stable test accuracy and train accuracy than before were obtained.

도 10은 [표 4] - 8 filters의 실험 결과 그래프이다.10 is a graph of the experimental results of [Table 4] - 8 filters.

도 11은 [표 5] - condition3의 실험 결과 그래프이다.11 is a graph showing the experimental results of [Table 5] - condition3.

도10 및 도11은 각각 [표 4] - 8 filters의 3 layered CNN와 [표 5] 조건 3(con.3)의 hyper parameter를 적용한 5 layered CNN의 학습 실험 결과 그래프이다. 3 layered CNN의 경우에는 accuracy의 variation이 커서 test accuracy가 높게 나왔다고 하더라도 실제로 애플리케이션에 적용했을 시 오차가 발생할 가능성이 높다. 따라서 본 발명에서는 상대적으로 accuracy의 variation이 작은 5 layered CNN의 조건 3의 hyper parameter를 최종적으로 선택하여 적용하였다.10 and 11 are graphs of the learning experiment results of the 3-layered CNN of [Table 4] - 8 filters and the 5-layered CNN to which the hyper parameter of [Table 5] condition 3 (con.3) is applied, respectively. In the case of 3-layered CNN, there is a large variation in accuracy, so even if the test accuracy is high, there is a high possibility of error when applied to an application. Therefore, in the present invention, the hyper parameter of condition 3 of 5-layered CNN with relatively small variation in accuracy was finally selected and applied.

도 12는 본 발명의 일 실시예에 따른 제스처에 따른 미디어 효과 적용 화면을 도시한 것이다.12 illustrates a screen for applying a media effect according to a gesture according to an embodiment of the present invention.

제스처에 따른 미디어 효과 적용 실험 결과를 설명하면, 도 12는 앞에서 설명한 제안 프레임워크 및 데이터셋, CNN구조를 활용하여 구현한 인터랙티브 미디어 콘텐츠의 결과이다. 아무도 없는 공간에 사람이 들어간 직후 카메라가 사용자 인식에 성공하면 도8의 ‘사용자 인식’에 해당하는 효과를 맵핑한다. 이후 기 모으기와 장풍 쏘기 및 발차기 동작을 취했을 경우 아래와 같이 각 제스처 번호에 해당되는 파티클 효과가 사용자의 몸 위로 맵핑된다. 이와 같이, 제스처 학습에 활용된 CNN 구조는 5 layered CNN 구조에서도 accuracy의 variation과 run time을 고려하여 최대한 안정적인 성능과 빠른 속도의 결과를 가진 결과 값을 적용하여 동적 프로젝션 맵핑 콘텐츠 실험을 진행하였다. 그 결과 사용자의 제스처에 따라 실시간으로 알맞은 파티클 효과가 맵핑됨을 확인할 수 있었다.When explaining the experimental results of applying media effects according to gestures, FIG. 12 shows the results of interactive media contents implemented using the proposed framework, dataset, and CNN structure described above. If the camera succeeds in user recognition immediately after a person enters an empty space, the effect corresponding to 'user recognition' in FIG. 8 is mapped. After that, when raising, shooting, and kicking are performed, the particle effect corresponding to each gesture number is mapped onto the user's body as shown below. As such, in the CNN structure used for gesture learning, dynamic projection mapping content experiments were conducted by applying the result value with the maximum stable performance and fast speed in consideration of the accuracy variation and run time even in the 5-layered CNN structure. As a result, it was confirmed that the appropriate particle effect was mapped in real time according to the user's gesture.

상술한 실험 결과를 통해 본 발명의 일 실시예에 따른 미디어 콘텐츠 제작 장치(100)는 제작에 어려움을 느끼는 창작자들을 위해 제스처 인식 기반의 인터랙티브 콘텐츠를 구현할 수 있는 간단하고도 직관적인 프레임워크를 제공할 수 있다. 이에 따라, 미디어 콘텐츠 제작 장치(100)를 처음 사용하는 사람들도 이해하기 쉬운 제스처의 데이터셋과 각 제스처에 어울리는 파티클 효과를 연결하고, 번호를 붙인 구성파일로 만들어 프로그램에서 빠르게 읽고 처리할 수 있다. Through the above-described experimental results, the apparatus 100 for producing media content according to an embodiment of the present invention provides a simple and intuitive framework for implementing interactive content based on gesture recognition for creators who feel difficult in production. can Accordingly, even those who are using the media content creation apparatus 100 for the first time can connect a dataset of easy-to-understand gestures and particle effects suitable for each gesture, and create a numbered configuration file so that they can be read and processed quickly in a program.

이하에 상술한 도1 내지 도12에 도시된 구성 중 동일한 기능을 수행하는 구성의 경우 설명을 생략하기로 한다. Hereinafter, a description of a configuration performing the same function among the configurations shown in FIGS. 1 to 12 described above will be omitted.

도 13은 본 발명의 일 실시예에 따른 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 방법을 설명하기 위한 순서도이다.13 is a flowchart illustrating a method for creating interactive media content for gesture recognition linked to dynamic projection mapping according to an embodiment of the present invention.

도 13을 참조하면, 본 발명의 동적 프로젝션 매핑 연동 제스처 인식 인터랙티브 미디어 콘텐츠 제작 방법은 미리 정해진 제스처와 미디어 파티클 효과가 기매칭된 구성파일을 리딩하고, 사용자의 제스처를 촬영하는 카메라로부터 수집된 제스처 이미지에 기초하여 모션 히스토리 이미지를 생성하는 단계(S110), 모션 히스토리 이미지를 제스처 추론 모델에 입력하고, 제스처 추론 모델로부터 출력된 제스처 타입 정보에 기초하여, 사용자의 현재 상태의 제스처 타입을 결정하는 단계(S120) 및 카메라(10) 및 프로젝터(20)간 공간 매핑 정보에 기초하여, 사용자의 제스처 변화에 따라 결정된 각 제스처 타입에 해당하는 미디어 파티클 효과를 사용자의 위치를 추적하여 프로젝션 매핑하는 단계(S130)를 포함하되, 제스처 추론 모델은 합성곱 신경망(Convolutional neural network, CNN)을 통해 학습될 수 있다.Referring to FIG. 13 , in the method for producing interactive media content with dynamic projection mapping interlocking gesture recognition according to the present invention, a gesture image collected from a camera that reads a configuration file in which a predetermined gesture and media particle effect are matched in advance, and captures a user's gesture generating a motion history image based on (S110), inputting the motion history image to a gesture inference model, and determining the gesture type of the user's current state based on the gesture type information output from the gesture inference model ( Based on the spatial mapping information between S120) and the camera 10 and the projector 20, the media particle effect corresponding to each gesture type determined according to the user's gesture change is projected by tracking the user's location (S130) Including, the gesture inference model may be learned through a convolutional neural network (CNN).

S110단계는 제스처 이미지에 대하여 현재 프레임 시점을 기준으로 기설정된 소정의 프레임 이전부터 현재까지 연속하는 프레임 이미지를 순차적으로 누적시켜 모션 히스토리 이미지를 생성할 수 있다.In step S110, a motion history image may be generated by sequentially accumulating continuous frame images from before a predetermined frame to the present with respect to the gesture image based on the current frame time.

모션 히스토리 이미지를 생성하는 경우, 각 프레임 이미지 별로 사용자의 실루엣 영역을 추출하되, 추출된 사용자 실루엣 정보를 원형큐(circular queue)에 저장하고, 원형큐로부터 과거 시점부터 현재 시점까지 실루엣 정보를 순차적으로 리딩하여 그레이 스케일 이미지로 압축하는 것이되, 각 프레임 이미지는 타임 스탬프에 매칭되는 픽셀값으로 채워진 실루엣 영역과 검은색으로 픽셀값이 채워진 나머지 배경 영역으로 구성될 수 있다. When generating a motion history image, the user's silhouette region is extracted for each frame image, the extracted user silhouette information is stored in a circular queue, and the silhouette information from the past time to the present time is sequentially retrieved from the circular queue. It is read and compressed into a grayscale image, but each frame image may be composed of a silhouette area filled with pixel values matching the time stamp and the remaining background area filled with black pixel values.

S120단계는 제스처 추론 모델로부터 출력된 제스처 타입 정보로 제스처 큐(queue)를 구성하고, 제스처 큐를 이용하여 각 프레임 이미지 별 노이즈를 제거할 수 있다.In step S120, a gesture queue may be configured with gesture type information output from the gesture inference model, and noise for each frame image may be removed using the gesture queue.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. do.

10: 카메라
20: 프로젝터
100: 미디어 콘텐츠 제작 장치
110: 메모리
120: 통신모듈
130: 프로세서
140: 디스플레이부10: camera
20: projector
100: media content creation device
110: memory
120: communication module
130: processor
140: display unit

Claims

A method for producing dynamic projection mapping-linked gesture recognition interactive media content performed by a device, the method comprising:
(a) reading a configuration file in which a predetermined gesture and media particle effect are matched in advance, and generating a motion history image based on a gesture image collected from a camera that captures a user's gesture;
(b) inputting the motion history image to a gesture inference model, and determining a gesture type of the user's current state based on gesture type information output from the gesture inference model; and
(c) based on the spatial mapping information between the camera and the projector, projecting mapping the media particle effect corresponding to each gesture type determined according to the change of the user's gesture by tracking the position of the user;
The gesture inference model is to be learned through a convolutional neural network (CNN),
A method for creating interactive media content with gesture recognition in conjunction with dynamic projection mapping.

The method of claim 1,
The step (a) is
Generating the motion history image by sequentially accumulating continuous frame images from before a predetermined frame to the present with respect to the gesture image based on the current frame time point,
A method for creating interactive media content with gesture recognition in conjunction with dynamic projection mapping.

3. The method of claim 2,
When generating the motion history image,
The user's silhouette area is extracted for each frame image, but the extracted user silhouette information is stored in a circular queue,
From the circular cue, the silhouette information is sequentially read from the past time to the present time and compressed into a gray scale image,
Each frame image is composed of the silhouette area filled with pixel values matching the time stamp and the remaining background area filled with black pixel values,
A method for creating interactive media content with gesture recognition in conjunction with dynamic projection mapping.

4. The method of claim 3,
Step (b) is
constructing a gesture queue with the gesture type information output from the gesture inference model, and removing noise for each frame image by using the gesture queue,
A method for creating interactive media content with gesture recognition in conjunction with dynamic projection mapping.

A dynamic projection mapping interlocking gesture recognition interactive media content production apparatus, comprising:
Dynamic projection mapping interlocking gesture recognition method for producing interactive media content; a memory in which a program is stored;
A processor for executing the program stored in the memory;
The processor reads a configuration file in which a predetermined gesture and media particle effect are matched in advance by executing the program, and generates a motion history image based on a gesture image collected from a camera that captures a user's gesture,
inputting the motion history image to a gesture inference model, and determining a gesture type of the user's current state based on gesture type information output from the gesture inference model;
Based on the spatial mapping information between the camera and the projector, the media particle effect corresponding to each gesture type determined according to the change of the user's gesture is projected by tracking the location of the user,
The gesture inference model is to be learned through a convolutional neural network (CNN),
Dynamic projection mapping interlocking gesture recognition interactive media content creation device.

6. The method of claim 5,
the processor is
Generating the motion history image by sequentially accumulating continuous frame images from before a predetermined frame to the present with respect to the gesture image based on the current frame time point,
Dynamic projection mapping interlocking gesture recognition interactive media content creation device.

7. The method of claim 6,
the processor is
When generating the motion history image,
The user's silhouette area is extracted for each frame image, but the extracted user silhouette information is stored in a circular queue,
From the circular cue, the silhouette information is sequentially read from the past time to the present time and compressed into a gray scale image,
Each frame image is composed of the silhouette area filled with pixel values matching the time stamp and the remaining background area filled with black pixel values,
Dynamic projection mapping interlocking gesture recognition interactive media content creation device.

8. The method of claim 7,
the processor is
When the gesture type is determined, a gesture queue is configured with the gesture type information output from the gesture inference model, and noise for each frame image is removed using the gesture queue,
Dynamic projection mapping interlocking gesture recognition interactive media content creation device.

A non-transitory computer-readable recording medium having recorded thereon a computer program for performing the method of claim 1 , wherein the dynamic projection mapping-linked gesture recognition interactive media content production method is performed.