KR20210040604A

KR20210040604A - Action recognition method and device

Info

Publication number: KR20210040604A
Application number: KR1020190123041A
Authority: KR
Inventors: 전문구; 유종민; 윤용상; 이윤관
Original assignee: 광주과학기술원
Priority date: 2019-10-04
Filing date: 2019-10-04
Publication date: 2021-04-14
Also published as: KR102334338B1

Abstract

The present invention relates to an action recognition method and device using an unsupervised learning method. An action recognition device according to one embodiment of the present invention may comprise: an input module for receiving analysis target image and reference image; a feature extraction module that extracts features of the received analysis target image and the received reference image in frame units using an artificial neural network, and extracts the core features that are some of the extracted features, respectively; and a verification module that calculates the similarity between the extracted key features, and outputs the reference image having the greatest similarity among reference images for which the similarity is calculated. Therefore, the present invention has the effect of being capable of classifying an input image by analyzing a previously unlearned action.

Description

Behavior recognition method and device {Action recognition method and device}

본 발명은 행위 인식 방법 및 장치에 관한 발명으로, 더욱 상세하게는 비지도 학습 방법을 이용한 행위 인식 방법 및 장치에 관한 발명이다.The present invention relates to a method and apparatus for recognizing actions, and more particularly, to a method and apparatus for recognizing actions using an unsupervised learning method.

감시 시스템은 폭행, 절도 등의 범죄 및 돌발 상황 등에 대비하여 인명 및 자산 피해를 최소화하기 위한 현대 사회의 필수적 요소로 자리잡고 있다. 그러나 감시자가 일일이 감시 시스템을 모니터링하여 카메라 영상을 확인하는 것을 불가능한 것이 현실이기 때문에 획득한 영상을 실시간으로 분석하여 감시 구역 내의 객체를 탐지할 수 있는 감시 시스템이 요구되었다. 이러한 요구를 토대로, 인간의 신경망 알고리즘에서 파생된, 데이터를 분류할 수 있는 딥 러닝(deep learning) 기반 행위 인식 방법이 최근 주목 받고 있다. Surveillance systems are becoming an essential element of modern society to minimize damage to people and assets in preparation for crimes such as assault and theft and unexpected situations. However, since it is a reality that it is impossible for a monitor to check the camera image by monitoring the surveillance system one by one, a surveillance system capable of detecting objects in the surveillance area by analyzing the acquired image in real time was required. Based on this demand, a deep learning-based behavior recognition method that can classify data derived from a human neural network algorithm has recently attracted attention.

인간의 행동, 행위를 인식하는 것은 감시 시스템을 개발하는 데 중요한 기술이다. 지난 몇 년 간, 행위 인식 방법에 대한 많은 방법이 제안되었다. 종래의 행위 인식 기술은 딥 러닝 방식 중 지도 학습(supervised learning) 방법을 기반으로 행위가 표시된 여러 학습 영상을 제공받고, 학습 영상으로부터 지도 학습을 통해 특징들을 추출하여 입력된 영상을 학습 영상의 카테고리에 분류하는 것이 가능하였다.Recognizing human behavior and behavior is an important skill in developing surveillance systems. In the past few years, a number of methods have been proposed for behavior recognition methods. Conventional behavior recognition technology is provided with multiple learning images showing behaviors based on supervised learning among deep learning methods, and extracts features from the learning images through supervised learning, so that the input image is added to the category of the learning image. It was possible to classify.

그러나 이러한 방법들은 사전에 정의된 행동만 분류가 가능하여 탐지할 수 있는 행위가 매우 한정적이다. 인간의 행동은 다양하고 복잡하므로 이러한 인식 모델을 적용하는 데는 제한이 따를 수 밖에 없다. 또한, 분류를 목표로 하는 행위를 추가하려면, 해당 행위에 대한 수많은 학습 영상 데이터를 더 필요로 하므로 새로운 행위가 추가될 시, 모델 재학습 및 모델의 구조 변경이 필요하여 그 과정이 복잡하다는 단점이 있었다.However, these methods can only classify predefined actions, so the detectable actions are very limited. Since human behavior is diverse and complex, the application of this cognitive model is bound to be limited. In addition, to add an action targeting classification, a lot of training image data for the action is required, so when a new action is added, retraining the model and changing the structure of the model are required, which complicates the process. there was.

이에 따라, 사전에 학습되지 않았던 행위도 인식할 수 있고, 새로운 행위들을 추가할 때 재학습 과정이 간소화되고 모델의 구조 변경이 필요없는, 기존에 없던 새로운 방식의 행위 인식 방법이 요구되고 있다.Accordingly, it is possible to recognize behaviors that have not been learned in advance, and when adding new behaviors, the relearning process is simplified and there is no need to change the structure of the model, and a new method of behavior recognition is required.

H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE international con?ference on computer vision， pages 3551-3558, 2013. H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE international con?ference on computer vision, pages 3551-3558, 2013. K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568-576, 2014. K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568-576, 2014. L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Con?ference on Computer Vision, pages 4597-4605, 2015. L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Con?ference on Computer Vision, pages 4597-4605, 2015. L. Wang, Y. Qiao, and X. Tang. Mofap: A multi-level representation for action recognition. International Journal of Computer Vision, 119(3):254-271, 2016. L. Wang, Y. Qiao, and X. Tang. Mofap: A multi-level representation for action recognition. International Journal of Computer Vision, 119(3):254-271, 2016. Z. Lan, Y. Zhu, A. G. Hauptmann, and S. Newsam. Deep local video feature for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1-7, 2017. Z. Lan, Y. Zhu, A. G. Hauptmann, and S. Newsam. Deep local video feature for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1-7, 2017. A. Diba, V. Sharma, and L. Van Gool. Deep temporal linear encoding networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2329- 2338, 2017. A. Diba, V. Sharma, and L. Van Gool. Deep temporal linear encoding networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2329-2338, 2017. J. Zhu, Z. Zhu, and W. Zou. End-to-end video-level representation learning for action recognition. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 645-650. IEEE, 2018. J. Zhu, Z. Zhu, and W. Zou. End-to-end video-level representation learning for action recognition. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 645-650. IEEE, 2018.

본 발명은 전술한 문제점을 해결하고자 한 것으로, 비지도 학습(unsupervised learning) 방법을 이용한 행위 인식 방법 및 장치를 제공하는 것을 목적으로 한다.An object of the present invention is to solve the above-described problem, and an object of the present invention is to provide a method and apparatus for recognizing a behavior using an unsupervised learning method.

또한, 본 발명은 기존에 학습되지 않은 행위도 분류할 수 있는 행위 인식 방법 및 장치를 제공하는 것을 목적으로 한다.In addition, an object of the present invention is to provide an action recognition method and apparatus capable of classifying an action that has not been previously learned.

본 발명의 목적들은 상술된 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-described objects, and other objects not mentioned will be clearly understood from the following description.

본 발명의 일 실시예에 따른 행위 인식 방법은 (a) 분석대상 영상 및 복수의 기준 영상들 중 제1 기준 영상을 입력받는 단계; (b) 인공 신경망을 이용하여 상기 분석대상 영상의 특징들(features)과 상기 제1 기준 영상의 특징들을 프레임 단위로 추출하는 단계; (c) 상기 추출된 분석대상 영상의 특징들과 상기 제1 기준 영상의 특징들 중 일부인 핵심 특징들을 각각 추출하는 단계; 및 (d) 상기 각각 추출된 핵심 특징들 간의 유사도를 계산하는 단계를 포함할 수 있다.The behavior recognition method according to an embodiment of the present invention includes the steps of: (a) receiving an analysis target image and a first reference image from among a plurality of reference images; (b) extracting features of the analysis target image and features of the first reference image in a frame unit using an artificial neural network; (c) extracting features of the extracted image to be analyzed and core features that are some of the features of the first reference image; And (d) calculating a degree of similarity between the extracted core features.

상기 (d) 단계 이후에, (e) 상기 복수의 기준 영상들 중 상기 제1 기준 영상과 상이한 제2 기준 영상을 입력받는 단계; (f) 상기 분석대상 영상의 특징들과 상기 제2 기준 영상의 특징들 간의 유사도를 계산하는 단계: 및 (g) 상기 제1 기준 영상과 제2 기준 영상 중 계산된 유사도가 더 큰 기준 영상을 출력하는 단계를 더 포함할 수 있다.After the step (d), (e) receiving a second reference image different from the first reference image among the plurality of reference images; (f) calculating a similarity between the features of the analysis target image and the features of the second reference image: and (g) a reference image having a larger calculated similarity among the first and second reference images. It may further include the step of outputting.

상기 (g) 단계 이후에, 상기 복수의 기준 영상들 중에서 나머지 영상들 중 하나의 기준 영상과 상기 분석대상 영상의 특징들 간의 유사도를 계산하여, 가장 큰 유사도를 갖는 기준 영상을 출력하는 단계를 더 포함할 수 있다.After the step (g), calculating a similarity between one of the remaining images among the plurality of reference images and features of the analysis target image, and outputting a reference image having the greatest similarity. Can include.

상기 (d) 단계 이후에, 상기 제1 기준 영상을 제외한 나머지 기준 영상들에 대해 상기 (b), (c) 및 (d) 단계를 반복하여, 계산된 유사도 중 가장 큰 유사도를 갖는 기준 영상을 출력하는 단계를 더 포함할 수 있다.After the step (d), the steps (b), (c) and (d) are repeated for the other reference images except for the first reference image to obtain a reference image having the largest similarity among the calculated similarities. It may further include the step of outputting.

상기 (a) 단계 이전에, 트레이닝 단계를 더 포함하고, 상기 트레이닝 단계는, 행위 정보를 포함하는 앵커(anchor) 영상, 상기 앵커 영상과 동일한 행위 정보를 포함하는 포지티브(positive) 영상 및 상기 앵커 영상과 상이한 행위 정보를 포함하는 네거티브(negative) 영상을 입력받는 단계; 상기 앵커 영상과 포지티브 영상 간의 유사도 및 상기 앵커 영상과 네거티브 영상 간의 유사도를 계산하는 단계; 및 상기 계산된 유사도들을 이용하여 손실 함수를 계산하는 단계를 포함할 수 있다.Prior to the step (a), the training step further includes a training step, wherein the training step includes an anchor image including behavior information, a positive image including behavior information identical to that of the anchor image, and the anchor image Receiving a negative image including behavior information different from the one; Calculating a similarity between the anchor image and the positive image and a similarity between the anchor image and the negative image; And calculating a loss function using the calculated similarities.

상기 (a) 단계에서, 상기 분석대상 영상은 외부로부터 입력받고, 상기 기준 영상은 상기 기준 영상이 기저장된 데이터베이스부로부터 입력받을 수 있다.In the step (a), the analysis target image may be input from the outside, and the reference image may be input from a database unit in which the reference image is previously stored.

상기 (b) 단계는, 각각의 프레임에 대해 복수 개의 특징 맵(feature map)을 생성하는 단계; 상기 특징 맵들 중 사이즈가 같은 특징 맵들을 블록화하는 단계; 및 하나의 블록 내부에 포함된 각 계층의 특징 맵이 그 이전에 배치된 모든 계층의 특징 맵들의 출력값을 입력받아 특징들을 추출하는 단계를 포함할 수 있다.The step (b) may include generating a plurality of feature maps for each frame; Blocking feature maps having the same size among the feature maps; And extracting features by receiving the output values of feature maps of all layers arranged before the feature map of each layer included in one block.

상기 (c) 단계는, K-평균 클러스터링(K-means clustering) 알고리즘을 이용하여 핵심 특징들을 추출하는 단계를 포함할 수 있다.The step (c) may include extracting key features using a K-means clustering algorithm.

상기 (d) 단계는, 상기 추출된 핵심 특징들 중 상기 분석대상 영상 및 제1 기준 영상에서 행위가 발생한 구간에 속하는 핵심 특징들만 선별하여 유사도를 계산하는 단계를 포함할 수 있다.The step (d) may include calculating a similarity by selecting only core features belonging to a section in which an action occurs in the analysis target image and the first reference image among the extracted core features.

본 발명의 일 실시예에 따른 컴퓨터 판독 가능한 기록 매체는 상기 기재된 방법을 컴퓨터 상에서 수행하기 위한 프로그램을 기록할 수 있다.The computer-readable recording medium according to an embodiment of the present invention may record a program for performing the above-described method on a computer.

본 발명의 일 실시예에 따른 행위 인식 장치는 분석대상 영상 및 기준 영상을 입력받는 입력 모듈; 인공 신경망을 이용하여 상기 입력받은 분석대상 영상의 특징들(features)과 상기 입력받은 기준 영상의 특징들을 프레임 단위로 추출하고, 상기 추출된 특징들 중 일부인 핵심 특징들을 각각 추출하는 특징 추출 모듈; 및 상기 추출된 핵심 특징들 간의 유사도를 계산하고, 유사도가 계산된 기준 영상들 중 가장 큰 유사도를 갖는 기준 영상을 출력하는 검증 모듈을 포함할 수 있다.A behavior recognition apparatus according to an embodiment of the present invention includes an input module for receiving an analysis target image and a reference image; A feature extraction module for extracting features of the received analysis target image and features of the input reference image in a frame unit using an artificial neural network, and extracting core features, which are some of the extracted features, respectively; And a verification module that calculates a similarity between the extracted core features and outputs a reference image having the largest similarity among the reference images for which the similarity is calculated.

상기 기준 영상을 저장하는 데이터베이스부를 더 포함할 수 있다.It may further include a database unit for storing the reference image.

행위 정보를 포함하는 앵커(anchor) 영상, 상기 앵커 영상과 동일한 행위 정보를 포함하는 포지티브(positive) 영상 및 상기 앵커 영상과 상이한 행위 정보를 포함하는 네거티브(negative) 영상으로부터 계산된 상기 앵커 영상과 포지티브 영상 간의 유사도 및 상기 앵커 영상과 네거티브 영상 간의 유사도를 입력받고, 상기 입력받은 유사도들을 이용하여 손실 함수를 계산하여 상기 제1항의 행위 인식 방법을 트레이닝하는 트레이닝 모듈을 더 포함할 수 있다.The anchor image and the positive calculated from an anchor image including behavior information, a positive image including behavior information identical to the anchor image, and a negative image including behavior information different from the anchor image A training module that receives a similarity between images and a similarity between the anchor image and the negative image, calculates a loss function using the received similarities, and trains the behavior recognition method of claim 1.

상기 입력 모듈은, 상기 분석대상 영상을 외부로부터 입력받고, 상기 기준 영상을 데이터베이스부로부터 입력받을 수 있다.The input module may receive the analysis target image from an external source, and may receive the reference image from a database unit.

상기 특징 추출 모듈은, 각각의 프레임에 대해 복수 개의 특징 맵(feature map)을 생성하고, 상기 특징 맵들 중 사이즈가 같은 특징 맵들을 블록화하고, 하나의 블록 내부에 포함된 각 계층의 특징 맵이 그 이전에 배치된 모든 계층의 특징 맵들의 출력값을 입력받아 특징들을 추출할 수 있다.The feature extraction module generates a plurality of feature maps for each frame, blocks feature maps of the same size among the feature maps, and a feature map of each layer included in one block is Features can be extracted by receiving output values of feature maps of all layers previously arranged.

상기 특징 추출 모듈은, K-평균 클러스터링(K-means clustering) 알고리즘을 이용하여 핵심 특징들을 추출할 수 있다.The feature extraction module may extract core features using a K-means clustering algorithm.

상기 검증 모듈은, 상기 추출된 핵심 특징들 중 상기 분석대상 영상 및 기준 영상에서 행위가 발생한 구간에 속하는 핵심 특징들만 선별하여 유사도를 계산할 수 있다.The verification module may calculate a similarity by selecting only core features belonging to a section in which an action occurs in the analysis target image and the reference image among the extracted core features.

본 발명의 일 실시예에 따른 행위 인식 방법은, 기존에 학습되지 않은 행위도 분석하여 입력 영상을 분류하는 것이 가능하다는 효과가 있다.The method for recognizing an action according to an embodiment of the present invention has an effect that it is possible to classify an input image by analyzing an action that has not been previously learned.

또한, 본 발명의 일 실시예에 따른 행위 인식 방법은, 본 방법을 통해 행위 인식 장치가 스스로 객체의 행위를 학습할 수 있는 바 감시 시스템의 무인화가 가능하다는 효과가 있다.In addition, the method for recognizing an action according to an embodiment of the present invention has an effect that an unmanned monitoring system is possible since the action recognition apparatus can learn the actions of an object by itself through the method.

또한, 본 발명의 일 실시예에 따른 행위 인식 방법은, 새로운 행위들을 추가할 때, 모델의 구조 변경이 필요하지 않아 비용을 최소화할 수 있다는 효과가 있다.In addition, the method for recognizing an action according to an embodiment of the present invention has an effect that it is possible to minimize cost because it is not necessary to change the structure of a model when adding new actions.

본 발명의 효과들은 이상에서 언급된 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects that are not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 행위 인식 장치를 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른 행위 인식 장치의 특징 추출 모듈에서 영상의 특징을 프레임 단위로 추출하는 것을 예시적으로 나타내는 도면이다.
도 3은 본 발명의 일 실시예에 따른 행위 인식 장치의 특징 추출 모듈에서 핵심 특징들을 추출하는 것을 예시적으로 나타내는 도면이다.
도 4는 본 발명의 일 실시예에 따른 행위 인식 장치의 검증 모듈에서 추출된 핵심 특징들 중 영상에서 행위가 발생한 구간에 속하는 핵심 특징들만을 선별하는 것을 예시적으로 나타내는 도면이다.
도 5는 본 발명의 일 실시예에 따른 행위 인식 장치의 트레이닝 과정을 예시적으로 나타내는 도면이다.
도 6은 본 발명의 일 실시예에 따른 행위 인식 방법의 흐름도를 나타내는 도면이다.1 is a diagram illustrating an apparatus for recognizing an action according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of extracting a feature of an image in a frame unit by a feature extraction module of the behavior recognition apparatus according to an embodiment of the present invention.
3 is a diagram illustrating an example of extracting core features from a feature extraction module of the behavior recognition apparatus according to an embodiment of the present invention.
FIG. 4 is a diagram exemplarily illustrating selecting only key features belonging to a section in which an action occurs in an image among key features extracted from a verification module of the behavior recognition apparatus according to an embodiment of the present invention.
5 is a diagram illustrating a training process of an apparatus for recognizing an action according to an embodiment of the present invention.
6 is a diagram illustrating a flow chart of a method for recognizing an action according to an embodiment of the present invention.

본 명세서 또는 출원에 개시되어 있는 본 발명의 실시 예들에 대해서 특정한 구조적 내지 기능적 설명들은 단지 본 발명에 따른 실시 예를 설명하기 위한 목적으로 예시된 것으로, 본 발명에 따른 실시 예들은 다양한 형태로 실시될 수 있으며 본 명세서 또는 출원에 설명된 실시 예들에 한정되는 것으로 해석되어서는 아니 된다.Specific structural or functional descriptions of the embodiments of the present invention disclosed in this specification or application are exemplified only for the purpose of describing the embodiments according to the present invention, and the embodiments according to the present invention may be implemented in various forms. And should not be construed as being limited to the embodiments described in this specification or application.

본 발명에 따른 실시 예는 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있으므로 특정 실시예들을 도면에 예시하고 본 명세서 또는 출원에 상세하게 설명하고자 한다. 그러나, 이는 본 발명의 개념에 따른 실시 예를 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the embodiments according to the present invention can be modified in various ways and have various forms, specific embodiments are illustrated in the drawings and will be described in detail in the present specification or application. However, this is not intended to limit the embodiments according to the concept of the present invention to a specific form of disclosure, it should be understood to include all changes, equivalents, or substitutes included in the spirit and scope of the present invention.

본 명세서에서 제1 및/또는 제2 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 즉, 구성요소들을 상기 용어들에 의해 한정하고자 함이 아니다.In the present specification, terms such as first and/or second are used only for the purpose of distinguishing one component from other components. That is, it is not intended to limit the components by the terms.

본 명세서에서 '포함하다' 라는 표현으로 언급되는 구성요소, 특징, 및 단계는 해당 구성요소, 특징 및 단계가 존재함을 의미하며, 하나 이상의 다른 구성요소, 특징, 단계 및 이와 동등한 것을 배제하고자 함이 아니다.Components, features, and steps referred to in the present specification as'comprise' means the existence of the corresponding components, features, and steps, and is intended to exclude one or more other components, features, steps, and equivalents thereof. This is not.

본 명세서에서 단수형으로 특정되어 언급되지 아니하는 한, 복수의 형태를 포함한다. 즉, 본 명세서에서 언급된 구성요소 등은 하나 이상의 다른 구성요소 등의 존재나 추가를 의미할 수 있다.Unless otherwise specified and stated in the singular form in the specification, plural forms are included. That is, the components and the like mentioned in the present specification may mean the presence or addition of one or more other components.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함하여, 본 명세서에서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자(통상의 기술자)에 의하여 일반적으로 이해되는 것과 동일한 의미이다.Unless otherwise defined, all terms used in this specification, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. to be.

즉, 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미인 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.That is, terms such as those defined in commonly used dictionaries should be interpreted as having meanings consistent with the meanings of the context of the related technology, and should be interpreted as ideal or excessively formal meanings unless explicitly defined in this specification. It doesn't work.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명하기로 한다.Hereinafter, the present invention will be described in detail by describing a preferred embodiment of the present invention with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 행위 인식 장치를 설명하기 위한 도면이다.1 is a diagram illustrating an apparatus for recognizing an action according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 행위 인식 장치(10)는 입력 모듈(100), 데이터베이스부(200), 특징 추출 모듈(300), 검증 모듈(400) 및 트레이닝 모듈(500)를 포함할 수 있다.Referring to FIG. 1, the behavior recognition apparatus 10 according to an embodiment of the present invention includes an input module 100, a database unit 200, a feature extraction module 300, a verification module 400, and a training module 500. ) Can be included.

행위 인식 장치(10)는 컨볼루션 신경망(convolutional neural network; CNN)을 이용하여 실시간 영상을 포함하는 다양한 영상으로부터 객체의 행동을 인식할 수 있다. 컨볼루션 신경망은 영상으로부터 다양한 정보들을 추출하고, 추출된 정보들을 기반으로 영상으로부터 객체의 행동을 인식하는 알고리즘의 집합일 수 있다. The behavior recognition apparatus 10 may recognize the behavior of an object from various images including real-time images using a convolutional neural network (CNN). The convolutional neural network may be a set of algorithms that extract various information from an image and recognize an object's behavior from an image based on the extracted information.

행위 인식 장치(10)는 카메라, CCTV, 블랙박스 등과 같은 영상 장치일 수 있으나 이에 한정되지 않고, 영상 장치를 포함하거나 영상 장치와 통신하여 영상을 제공받을 수 있는 컴퓨팅 기기일 수 있다. 예컨대, 행위 인식 장치(10)는 스마트폰, 태블릿 PC, PC, 스마트 TV, 휴대폰, 내비게이션, IoT 기기, 가전기기 등일 수 있다. The behavior recognition device 10 may be an imaging device such as a camera, CCTV, or black box, but is not limited thereto, and may include an imaging device or a computing device capable of receiving an image by communicating with an imaging device. For example, the behavior recognition device 10 may be a smart phone, a tablet PC, a PC, a smart TV, a mobile phone, a navigation device, an IoT device, a home appliance, or the like.

입력 모듈(100)은 분석대상 영상 및 기준 영상을 입력받을 수 있다.The input module 100 may receive an analysis target image and a reference image.

분석대상 영상은 행위 인식 장치(10)의 외부로부터 입력되는 영상으로, 행위 인식 장치(10)가 영상 내부에서 행동하는 객체의 행위를 인식하고자 하는 영상이다. 분석대상 영상은 예컨대, 행위 인식 장치(10)의 촬영카메라를 통하여 입력 모듈(100)로 실시간 입력되거나, 미리 제작된 비디오 클립(video clip)으로써 행위 인식 장치(10)의 입력 모듈(100)을 통해 입력될 수 있다. The analysis target image is an image input from the outside of the behavior recognition device 10, and is an image in which the behavior recognition device 10 attempts to recognize the behavior of an object acting inside the image. The image to be analyzed is, for example, input in real time to the input module 100 through a photographing camera of the behavior recognition device 10, or by using the input module 100 of the behavior recognition device 10 as a pre-made video clip. Can be entered through.

기준 영상은 분석대상 영상과 비교되어 행위 인식 장치(10)가 분석대상 영상 내부에서 행동하는 객체의 행위를 인식할 수 있도록 하는 역할을 수행할 수 있다. 기준 영상은 여러가지 카테고리의 기준 영상들이 저장되어 있는 데이터베이스부(200)로부터 입력 모듈(100)에 입력될 수 있다. The reference image may be compared with the image to be analyzed to perform a role of allowing the behavior recognition device 10 to recognize a behavior of an object acting within the image to be analyzed. The reference image may be input to the input module 100 from the database unit 200 in which reference images of various categories are stored.

데이터베이스부(200)는 상술한 바와 같이, 여러가지 카테고리의 기준 영상들이 저장될 수 있는 일종의 저장 공간이다. As described above, the database unit 200 is a type of storage space in which reference images of various categories can be stored.

데이터베이스부(200)는 입력 모듈(100)로 기준 영상을 송신하고, 외부로부터 새로운 카테고리의 기준 영상을 수신할 수 있다. 사용자는 데이터베이스부(200)에 기준 영상을 입력하여 새로운 기준 영상을 추가할 수 있다. The database unit 200 may transmit a reference image to the input module 100 and may receive a reference image of a new category from the outside. The user may add a new reference image by inputting the reference image to the database unit 200.

종래의 경우, 행위 인식 장치에 새로운 분류 목표 행위를 추가하려면, 해당 분류 목표 행위에 대한 영상 데이터를 다수 입력하여 학습하는 과정을 거쳐야 했으나 본 발명의 일 실시예에 따른 행위 인식 장치(10)는, 데이터베이스부(200)에 새로운 카테고리의 기준 영상을 추가하고 기준 영상의 특징을 추출하여 분석대상 영상과의 유사도를 계산하는 과정을 통해 새로운 분류 목표 행위로의 분류가 가능하다. 즉, 행위 인식 방법의 재학습 및 구조 변경이 필요하지 않고, 기존에 학습되지 않은 행위인 경우에도 분류가 가능한 것이다.In the conventional case, in order to add a new classification target behavior to the behavior recognition apparatus, a plurality of image data for the classification target behavior had to be input and learned. However, the behavior recognition apparatus 10 according to an embodiment of the present invention, It is possible to classify as a new classification target behavior by adding a reference image of a new category to the database unit 200 and calculating a similarity to an image to be analyzed by extracting features of the reference image. In other words, re-learning and structural changes of the behavior recognition method are not required, and classification is possible even if the behavior has not been previously learned.

특징 추출 모듈(300)은 입력받은 분석대상 영상의 특징들과 기준 영상의 특징들을 프레임(frame) 단위로 추출하고, 추출된 특징들 중 핵심 특징들을 추출할 수 있다.The feature extraction module 300 may extract the received features of the analysis target image and the features of the reference image in a frame unit, and extract core features among the extracted features.

특징 추출 모듈(300)은 입력 모듈(100)이 입력받은, 분석대상 영상 및 기준 영상을 입력 모듈(100)로부터 수신할 수 있다. 특징 추출 모듈(300)은 분석대상 영상 및 기준 영상으로부터 특징들을 추출하기 위하여 인공 신경망 모델을 이용할 수 있다. The feature extraction module 300 may receive an analysis target image and a reference image received by the input module 100 from the input module 100. The feature extraction module 300 may use an artificial neural network model to extract features from an analysis target image and a reference image.

예컨대, 특징 추출 모듈(300)은 DenseNET 등의 심층 신경망 모델을 이용할 수 있다. 심층 신경망 모델은 입력층, 컨볼루션층(convolution layer), 풀링층(pooling layer) 및 출력층을 포함하는 구조를 가질 수 있다. DenseNET과 같은 심층 신경망 모델을 사용할 경우, 특징 추출 시 발생하는 그래디언트(gradient) 손실 문제를 경감하고 추출 과정에서 추출되는 특징들을 연쇄적으로 축적하여 저차원의 특징들이 보다 잘 보존될 수 있으며, 파라미터의 개수를 감소하여 연산량을 줄이고 비교적 적은 데이터에서도 높은 학습 성과를 낼 수 있다. For example, the feature extraction module 300 may use a deep neural network model such as DenseNET. The deep neural network model may have a structure including an input layer, a convolution layer, a pooling layer, and an output layer. In the case of using a deep neural network model such as DenseNET, low-dimensional features can be better preserved by reducing the gradient loss problem that occurs during feature extraction and by accumulating the features extracted in the extraction process in a chain. By reducing the number, the amount of computation can be reduced, and high learning performance can be achieved even with relatively little data.

도 2는 본 발명의 일 실시예에 따른 행위 인식 장치가 특징들을 추출하는 과정을 예시적으로 나타내는 도면이다.2 is a diagram illustrating a process of extracting features by the behavior recognition apparatus according to an embodiment of the present invention.

도 2를 참조하면, 심층 신경망 모델을 이용하여 행위 인식 장치(10)가 입력된 영상의 특징들을 프레임 단위로 추출하는 것을 확인할 수 있다.Referring to FIG. 2, it can be seen that the behavior recognition apparatus 10 extracts features of an input image in units of frames using a deep neural network model.

특징 추출 모듈(300)은 수신한 영상으로부터 특징들을 프레임 단위로 추출할 수 있다. 즉, 각 영상마다 포함하는 프레임 수만큼의 특징을 추출할 수 있다. 입력되는 영상마다 포함하는 프레임 수가 다르므로 추출되는 특징의 개수 또한 입력되는 영상마다 달라지게 된다.The feature extraction module 300 may extract features in a frame unit from the received image. That is, as many features as the number of frames included in each image can be extracted. Since the number of frames included for each input image is different, the number of extracted features also varies for each input image.

영상의 한 프레임이 심층 신경망 모델에 입력되고, 컨볼루션층, 풀링층을 거쳐 복수 개의 특징 맵(feature map)을 생성할 수 있다. 특징 맵들 중 사이즈 같은 특징 맵들은 블록화되며, 하나의 블록 내부에 포함된 각 계층의 특징 맵이 그 이전에 배치된 모든 계층의 특징 맵들의 출력값을 입력받을 수 있다. One frame of an image is input to a deep neural network model, and a plurality of feature maps may be generated through a convolutional layer and a pooling layer. Feature maps such as size among the feature maps are blocked, and output values of feature maps of all layers disposed before the feature map of each layer included in one block may be input.

심층 신경망 모델은 입력한 영상 데이터로부터 특징들을 추출하기 위해 이용될 수 있다. 심층 신경망 모델은 복수의 층(layer), 블록(block)을 포함할 수 있다. 각각의 층들은 입력 데이터를 수신할 수 있고, 입력 데이터를 처리하여 출력 데이터를 생성할 수 있다. 즉, 심층 신경망 모델은 입력된 영상의 프레임을 컨볼루션층을 통해 컨볼루션하여 특징 맵을 생성할 수 있고, 풀링층을 통해 이를 샘플링하며, 하나의 블록 내부에 포함된 각 계층의 특징 맵이 그 이전에 배치된 모든 계층의 특징 맵들의 출력값들을 입력받아 그 특징을 축적해 나가는 방식으로 영상의 특징들을 구체화할 수 있다.The deep neural network model can be used to extract features from the input image data. The deep neural network model may include a plurality of layers and blocks. Each of the layers may receive input data and may process the input data to generate output data. That is, the deep neural network model can generate a feature map by convolving the frame of the input image through the convolution layer, and sample it through the pooling layer, and the feature map of each layer included in one block is The features of the image can be specified by receiving the output values of the feature maps of all the previously arranged layers and accumulating the features.

하위 계층에 배치된 특징 맵은 특징 추출 과정의 초기에 출력되는 특징 맵으로 낮은 레벨의 특징, 예컨대, 그래디언트(gradient)와 같은 특징이 추출될 수 있다. 상위 계층에 배치된 특징 맵은 하위 계층의 특징 맵들로부터의 정보를 축적하여 하위 계층의 특징 맵들보다 더 복잡한 특징들, 예컨대, 인간의 표정, 실루엣 등이 추출될 수 있다.The feature map arranged in the lower layer is a feature map that is output at the beginning of the feature extraction process, and a feature of a low level, for example, a feature such as a gradient, may be extracted. The feature maps arranged in the upper layer accumulate information from the feature maps of the lower layer to extract more complex features than the feature maps of the lower layer, such as human facial expressions and silhouettes.

종래에는 특징 맵은 그와 인접한 하위 계층의 특징 맵의 출력값만을 입력 받았으나, 본 발명의 일 실시예에 따른 행위 인식 장치(10)에서는 각 계층의 특징 맵이 그 이전에 배치된 모든 계층의 특징 맵으로부터 출력값을 중복적으로 입력받을 수 있는 바, 추출 과정에서 심층 신경망 모델의 층과 블록을 거치면서 특징이 더욱 구체화되며, 특징의 구체화에 따른 정보 손실을 방지할 수 있다. 따라서, 종래 기술보다 정교한 특징 추출이 가능하다.Conventionally, the feature map received only the output value of the feature map of the lower layer adjacent thereto. However, in the behavior recognition apparatus 10 according to an embodiment of the present invention, the feature map of each layer is a feature map of all the previously placed layers. Since the output value can be repeatedly input from, the feature is further specified by passing through the layers and blocks of the deep neural network model in the extraction process, and information loss due to the specification of the feature can be prevented. Therefore, more sophisticated feature extraction than the prior art is possible.

입력된 영상의 프레임으로부터 특징을 추출하는 알고리즘은 하기 수학식 1과 같이 표현될 수 있다.An algorithm for extracting features from a frame of an input image may be expressed as Equation 1 below.

여기서

는 특징 추출 함수,

는 영상의 i번째 프레임,

는 특징 추출 함수의 파라미터,

는 영상의 i번째 프레임으로부터 추출된 특징을 의미한다.here

Is the feature extraction function,

Is the i-th frame of the image,

Is the parameter of the feature extraction function,

Denotes a feature extracted from the i-th frame of the image.

특징 추출 모듈(300)은 영상의 프레임으로부터 추출된 특징들로부터 핵심 특징들을 추출할 수 있다. 행위 인식 장치(10)의 연산량을 줄이고 메모리 효율을 증대하기 위하여 입력된 영상의 모든 프레임으로부터 추출된 특징 중 핵심적인 특징들을 추출할 수 있다. 특징 추출 모듈(300)은 K-평균 클러스터링(K-means clustering) 알고리즘을 이용하여 추출된 특징들 중 일부인 핵심 특징들을 추출할 수 있다.The feature extraction module 300 may extract core features from features extracted from a frame of an image. In order to reduce the amount of computation of the behavior recognition apparatus 10 and increase memory efficiency, core features may be extracted from features extracted from all frames of an input image. The feature extraction module 300 may extract core features, which are some of the features extracted using a K-means clustering algorithm.

도 3은 본 발명의 일 실시예에 따른 행위 인식 장치에서 프레임 단위로 추출된 특징들 중 일부인 핵심 특징들을 추출하는 것을 예시적으로 나타내는 도면이다.FIG. 3 is a diagram exemplarily illustrating extraction of core features, which are some of the features extracted in units of frames, by the behavior recognition apparatus according to an embodiment of the present invention.

도 3을 참조하면, 영상의 프레임 단위로 추출된 특징들 중 핵심 특징들이 추출되는 것을 확인할 수 있다.Referring to FIG. 3, it can be seen that core features are extracted among features extracted in units of frames of an image.

특징 추출 모듈(300)은 K-평균 클러스터링 알고리즘을 이용하여 추출된 특징들 중 핵심 특징들을 추출할 수 있다. K-평균 클러스터링 알고리즘은 하기 수학식 2와 같이 표현될 수 있다.The feature extraction module 300 may extract core features from among features extracted using a K-means clustering algorithm. The K-means clustering algorithm can be expressed as Equation 2 below.

여기서 argmin은 argmin 함수,

는 영상의 j번째 프레임으로부터 추출된 특징,

는 K-평균 클러스터링 centroid의 집합을 이루는 핵심 특징을 의미한다(

).Where argmin is the argmin function,

Is the feature extracted from the j-th frame of the image,

Denotes the key features that make up the set of K-means clustering centroids (

).

즉, 입력된 영상으로부터 최종적으로 추출되는 특징은

로서 귀결된다.That is, the features finally extracted from the input image

It boils down to.

K-평균 클러스터링 알고리즘을 이용하여 추출된 핵심 특징들은 특징 추출 모듈(300)의 제1 저장부(미도시)에 저장될 수 있다. 제1 저장부(미도시)는 예컨대, 하드 드라이브(HDD)와 같은 저장 장치일 수 있으나 이에 한정되지는 않는다. 추출된 핵심 특징들이 저장됨으로써 추후 진행되는 행위 인식 과정에서 필요시 저장된 데이터들을 사용할 수 있다.Core features extracted using the K-means clustering algorithm may be stored in a first storage unit (not shown) of the feature extraction module 300. The first storage unit (not shown) may be, for example, a storage device such as a hard drive (HDD), but is not limited thereto. The extracted core features are stored, so that the stored data can be used if necessary in a later behavior recognition process.

검증 모듈(400)은 추출된 분석대상 영상의 핵심 특징들과 추출된 기준 영상의 핵심 특징들 간의 유사도를 계산하고, 분석대상 영상과 유사도가 계산된 기준 영상들 중 최대 유사도를 가지는 기준 영상을 출력할 수 있다.The verification module 400 calculates the similarity between the extracted key features of the analysis target image and the extracted key features of the reference image, and outputs a reference image having the maximum similarity among the analyzed target image and the reference images whose similarity is calculated. can do.

검증 모듈(400)은 특징 추출 모듈(300)에서 추출된 핵심 특징들을 특징 추출 모듈(300)로부터 수신할 수 있다. 검증 모듈(400)은 수신한 핵심 특징들을 이용하여 분석대상 영상 및 기준 영상 간의 유사도를 계산할 수 있다. 검증 모듈(400)은 분석대상 영상의 핵심 특징들과 기준 영상의 핵심 특징들 간의 유사도를 계산할 때, 추출된 핵심 특징들 중 영상에서 행위가 발생한 구간에 속하는 핵심 특징들만 선별하여 유사도를 계산할 수 있다. The verification module 400 may receive the core features extracted by the feature extraction module 300 from the feature extraction module 300. The verification module 400 may calculate a similarity between the analysis target image and the reference image using the received core features. When calculating the similarity between the core features of the image to be analyzed and the core features of the reference image, the verification module 400 may calculate the similarity by selecting only core features belonging to a section in which an action occurs in the image among the extracted core features. .

도 4는 본 발명의 일 실시예에 따른 행위 인식 장치의 검증 모듈에서 추출된 핵심 특징들 중 영상에서 행위가 발생한 구간에 속하는 핵심 특징들만을 선별하는 것을 예시적으로 나타내는 도면이다.FIG. 4 is a diagram exemplarily illustrating selecting only key features belonging to a section in which an action occurs in an image among key features extracted from a verification module of the behavior recognition apparatus according to an embodiment of the present invention.

도 4를 참조하면, 두 영상의 핵심 특징들 중 영상에서 행위가 발생한 구간에 속하는 핵심 특징들만 유사도 계산 시 고려되는 것을 확인할 수 있다. Referring to FIG. 4, it can be seen that among the core features of the two images, only core features belonging to a section in which an action occurs in the image is considered when calculating the similarity.

특징 추출 모듈(300)에서 핵심 특징을 추출하는 과정에서 영상의 시간 정보를 손실할 수 있다. 또한, 각 영상마다 프레임의 개수가 다르고, 행위가 발생하는 구간이 다르기에 직접적으로 유사도를 비교하기 어렵다. 따라서, 추출된 핵심 특징들 중 입력된 영상에서 행위가 발생한 구간에 속하는 핵심 특징들만을 선별하여 영상 간의 유사도를 계산할 수 있다.In the process of extracting a core feature in the feature extraction module 300, time information of an image may be lost. In addition, since the number of frames is different for each image and the section in which the action occurs is different, it is difficult to directly compare the similarity. Accordingly, it is possible to calculate the similarity between the images by selecting only the core features belonging to the section in which the action occurred in the input image among the extracted core features.

도 4에 도시된 매트릭스에서 '1'로 표시된 부분은 유사도 계산 시 고려되어야 할 구간들을 의미한다. 즉, 재생 시간이 다른 영상들 간의 유사도를 계산할 시 각 영상들에서 행위가 발생한 구간에 속하는 핵심 특징들만을 선별하여 유사도 계산에 이용하는 것이다.In the matrix shown in FIG. 4, a portion marked with '1' means sections to be considered when calculating the similarity. That is, when calculating the similarity between images having different playback times, only the core features belonging to the section in which the action occurred in each image is selected and used for calculating the similarity.

검증 모듈(400)에서 유사도 측정 시 사용되는 알고리즘은 하기 수학식 3과 같이 표현될 수 있다.The algorithm used when measuring the similarity in the verification module 400 may be expressed as Equation 3 below.

S는 유사도 함수,

는 벡터의 크기를 계산하기 위한 함수,

는 Heaviside 함수,

는 유사도 수준에 대한 하이퍼파라미터(hyperparameter)를 의미한다.S is the similarity function,

Is a function for calculating the size of a vector,

Heaviside function,

Means a hyperparameter for the level of similarity.

검증 모듈(400)은 최대 유사도를 가지는 기준 영상을 출력할 수 있다. 본 발명의 일 실시예에 따른 행위 인식 장치(10)는 외부로부터 입력된 분석대상 영상과 데이터베이스부(200)에 포함된 복수의 기준 영상들 각각에 대하여 유사도를 계산할 수 있다. 즉, 하나의 기준 영상에 대하여 유사도를 계산한 후, 데이터베이스부(200)에 저장된 다른 기준 영상에 대해서도 유사도를 계산하는 과정을 반복 수행할 수 있다. The verification module 400 may output a reference image having a maximum similarity. The behavior recognition apparatus 10 according to an embodiment of the present invention may calculate a similarity between an analysis target image input from an external source and a plurality of reference images included in the database unit 200. That is, after calculating the similarity with respect to one reference image, the process of calculating the similarity with respect to other reference images stored in the database unit 200 may be repeatedly performed.

계산된 유사도는 검증 모듈(400)의 제2 저장부(미도시)에 저장될 수 있다. 예컨대, 제2 저장부(미도시)는 하드 드라이브와 같은 저장 장치일 수 있으나 이에 한정되지는 않는다. 계산된 유사도가 제2 저장부(미도시)에 저장됨으로써 추후 진행되는 행위 인식 과정에서 필요시 저장된 데이터들을 사용할 수 있다.The calculated similarity may be stored in a second storage unit (not shown) of the verification module 400. For example, the second storage unit (not shown) may be a storage device such as a hard drive, but is not limited thereto. Since the calculated similarity is stored in the second storage unit (not shown), the stored data can be used if necessary in a later behavior recognition process.

검증 모듈(400)은 각 기준 영상에 대해 유사도를 계산한 후 이를 제2 저장부(미도시)에 저장하고 저장된 유사도 중, 최대 유사도를 가지는 기준 영상을 출력할 수 있다. 즉, 분석대상 영상 내부의 객체의 행위는 최대 유사도를 가지는 기준 영상의 카테고리로 분류될 수 있다. The verification module 400 may calculate a similarity for each reference image, store it in a second storage unit (not shown), and output a reference image having a maximum similarity among the stored similarities. That is, the behavior of the object inside the image to be analyzed may be classified into a category of the reference image having the maximum similarity.

검증 모듈(400)이 최대 유사도를 가지는 기준 영상을 출력하는 알고리즘은 하기 수학식 4와 같이 표현될 수 있다.The algorithm for outputting the reference image having the maximum similarity by the verification module 400 may be expressed as Equation 4 below.

여기서 argmax는 argmax 함수, i는 데이터베이스부에 기저장된 기준 영상의 인덱스(index)를 의미하고, S는 상기 수학식 3의 유사도 함수,

는 분석대상 영상에서 추출된 특징,

는 데이터베이스부의 i번째 기준 영상에서 추출된 특징을 의미한다.Here, argmax is an argmax function, i is an index of a reference image previously stored in the database unit, and S is a similarity function of Equation 3,

Is the feature extracted from the image to be analyzed,

Denotes a feature extracted from the i-th reference image of the database unit.

상술한 바와 같이, 본 발명의 일 실시예에 따른 행위 인식 장치(10)는 기존에 학습되지 않은 행위의 경우에도 그 행위가 포함된 새로운 기준 영상을 행위 인식 장치(10)에 입력하고 기준 영상으로부터 특징들을 추출하여 영상 간의 유사도를 계산하는 과정을 통해 행위 인식 방법의 재학습을 거치지 않고 새로운 행위를 분류하는 것이 가능하다. 또한, 심층 신경망 모델을 이용해 특징을 추출하므로 보다 정교한 특징 추출이 가능하다.As described above, the behavior recognition apparatus 10 according to an embodiment of the present invention inputs a new reference image including the behavior into the behavior recognition apparatus 10 even in the case of a previously unlearned behavior, and from the reference image. It is possible to classify a new behavior without relearning the behavior recognition method through the process of extracting features and calculating the similarity between images. In addition, since features are extracted using a deep neural network model, more sophisticated feature extraction is possible.

도 5는 본 발명의 일 실시예에 따른 행위 인식 장치의 트레이닝 과정을 예시적으로 나타내는 도면이다.5 is a diagram illustrating a training process of an apparatus for recognizing an action according to an embodiment of the present invention.

도 5를 참조하면, 본 발명의 일 실시예에 따른 행위 인식 장치(10)의 트레이닝 과정에서 일종의 트레이닝 샘플을 입력받은 후, 특정한 알고리즘을 통해 행위 인식 방법을 트레이닝하는 것을 확인할 수 있다.Referring to FIG. 5, after receiving a kind of training sample in the training process of the behavior recognition apparatus 10 according to an embodiment of the present invention, it may be confirmed that the behavior recognition method is trained through a specific algorithm.

트레이닝 모듈(500)은 본 발명의 일 실시예에 따른 행위 인식 장치(10)가 입력받은 트레이닝 샘플을 토대로 행위 인식 방법을 트레이닝할 수 있다.The training module 500 may train a behavior recognition method based on a training sample received by the behavior recognition apparatus 10 according to an embodiment of the present invention.

트레이닝 모듈(500)은 본 발명의 일 실시예에 따른 행위 인식 장치(10)가 미리 입력받은 트레이닝 샘플인 앵커 영상(anchor video clip), 포지티브 영상(positive video clip), 네거티브 영상(negative video clip)으로부터 각각의 특징을 추출한 후, 추출된 특징끼리의 유사도를 계산하고 이를 이용하여 행위 인식 방법을 트레이닝할 수 있다. The training module 500 includes an anchor video clip, a positive video clip, and a negative video clip, which are training samples previously input by the behavior recognition device 10 according to an embodiment of the present invention. After each feature is extracted from, the similarity between the extracted features is calculated, and the behavior recognition method can be trained using this.

포지티브 영상은 앵커 영상과 같은 카테고리에 속하는 영상, 네거티브 영상은 앵커 영상과 상이한 카테고리에 속하는 영상을 말한다. 예컨대, 앵커 영상이 테니스를 치는 행위를 포함하는 경우, 포지티브 영상 또한 테니스를 치는 행위를 포함할 수 있다. 이와 반대로 네거티브 영상은 앵커 영상과 상이한 카테고리에 속하는 행위, 예컨대, 아령을 드는 행위를 포함할 수 있다.The positive image refers to an image belonging to the same category as the anchor image, and the negative image refers to an image belonging to a different category from the anchor image. For example, when the anchor image includes an act of playing tennis, the positive image may also include an act of playing tennis. Conversely, the negative image may include an act belonging to a different category than the anchor image, for example, lifting a dumbbell.

본 발명의 일 실시예에 따른 행위 인식 장치(10)는 트레이닝 샘플을 입력받은 후, 트레이닝 샘플에 포함된 영상들의 특징들을 추출하고 영상들 간의 유사도를 계산할 수 있다. 즉, 앵커 영상과 포지티브 영상, 앵커 영상과 네거티브 영상 간의 유사도를 계산하고, 계산된 두 유사도를 이용하여 손실 함수를 계산하는 방식으로 행위 인식 방법을 트레이닝할 수 있다. 상기의 트레이닝 샘플 입력, 특징 추출, 유사도 계산 과정은 입력 모듈(100), 특징 추출 모듈(300), 검증 모듈(400)을 통해 진행될 수 있을 것이다. After receiving a training sample, the behavior recognition apparatus 10 according to an embodiment of the present invention may extract features of images included in the training sample and calculate a similarity between the images. That is, the behavior recognition method can be trained by calculating the similarity between the anchor image and the positive image, and the anchor image and the negative image, and calculating a loss function using the calculated two similarities. The above training sample input, feature extraction, and similarity calculation process may be performed through the input module 100, the feature extraction module 300, and the verification module 400.

트레이닝 모듈(500)에서 진행되는 트레이닝 과정에 사용되는 알고리즘은 하기 수학식 5와 같이 표현될 수 있다.The algorithm used in the training process performed in the training module 500 may be expressed as Equation 5 below.

여기서 L은 손실 함수, max는 max 함수, S는 상기 수학식 3의 유사도 함수,

는 앵커 영상,

는 포지티브 영상,

은 네거티브 영상,

은 minimum dissimilarity를 의미한다.Where L is the loss function, max is the max function, S is the similarity function of Equation 3,

Is the anchor video,

The positive video,

Silver negative video,

Means minimum dissimilarity.

도 6은 본 발명의 일 실시예에 따른 행위 인식 방법의 흐름도를 나타내는 도면이다.6 is a diagram illustrating a flow chart of a method for recognizing an action according to an embodiment of the present invention.

도 6을 참조하면, 본 발명의 일 실시예에 따른 행위 인식 방법은 분석대상 영상 및 복수의 기준 영상들 중 제1 기준 영상을 입력받는 단계(S602), 분석대상 영상의 특징들과 제1 기준 영상의 특징들을 프레임 단위로 추출하는 단계(S603), 추출된 분석대상 영상의 특징들과 제1 기준 영상의 특징들 중 일부인 핵심 특징들을 각각 추출하는 단계(S604) 및 각각 추출된 핵심 특징들 간의 유사도를 계산하는 단계(S605)를 포함할 수 있다. 또한, 제1 기준 영상을 제외한 나머지 기준 영상들에 대해 S603 내지 S605 단계를 반복하여 계산된 유사도 중 가장 큰 유사도를 갖는 기준 영상을 출력하는 단계(S606)를 더 포함할 수 있다. 또한, 행위 인식 방법을 트레이닝하는 단계(S601)를 더 포함할 수 있다. 6, the behavior recognition method according to an embodiment of the present invention includes receiving an analysis target image and a first reference image from among a plurality of reference images (S602), features of the analysis target image and a first reference image. Extracting the features of the image in units of frames (S603), extracting the features of the extracted analysis target image and the core features that are some of the features of the first reference image (S604), and between the extracted core features It may include a step of calculating the similarity (S605). In addition, the step S606 may further include outputting a reference image having the largest similarity among the calculated similarities by repeating steps S603 to S605 with respect to the other reference images except the first reference image. In addition, it may further include the step of training the behavior recognition method (S601).

행위 인식 방법을 트레이닝하는 단계(S601)는, 트레이닝 샘플, 즉, 행위 정보를 포함하는 앵커 영상, 앵커 영상과 동일한 행위 정보를 포함하는 포지티브 영상, 앵커 영상과 상이한 행위 정보를 포함하는 네거티브 영상을 입력받고, 앵커 영상과 포지티브 영상 간의 유사도 및 앵커 영상과 네거티브 영상 간의 유사도를 계산하고, 계산된 유사도를 이용하여 손실 함수를 계산하여 트레이닝하는 단계이다.In the step of training the behavior recognition method (S601), a training sample, that is, an anchor image including behavior information, a positive image including behavior information identical to the anchor image, and a negative image including behavior information different from the anchor image, are input. In this step, the similarity between the anchor image and the positive image and the similarity between the anchor image and the negative image are calculated, and a loss function is calculated and trained using the calculated similarity.

상술한 대로 트레이닝 샘플인 앵커, 포지티브, 네거티브 영상을 입력받고 각각의 특징들을 추출한 후, 유사도를 계산하고 트레이닝 모듈(500)에서 상기 수학식 5의 알고리즘을 이용하여 행위 인식 방법을 트레이닝할 수 있다. 종래 기술과 달리, 트레이닝을 거친 후 새로운 분류 목표 행위를 추가하기 위해 행위 인식 방법의 재학습 및 구조 변경이 필요하지 않다.As described above, after receiving the anchor, positive, and negative images, which are training samples, and extracting respective features, the similarity is calculated, and the training module 500 may train the behavior recognition method using the algorithm of Equation 5 above. Unlike the prior art, it is not necessary to relearn the behavior recognition method and change the structure in order to add a new classification target behavior after training.

분석대상 영상 및 복수의 기준 영상들 중 제1 기준 영상을 입력받는 단계(S602)는, 행위 인식 장치(10)의 입력 모듈(100)에 분석대상 영상 및 제1 기준 영상을 입력받는 단계이다.In step S602 of receiving an analysis target image and a first reference image from among the plurality of reference images, an analysis target image and a first reference image are input to the input module 100 of the behavior recognition apparatus 10.

분석대상 영상은 외부로부터 입력받을 수 있으며, 행위 인식 장치(10)가 영상 내부에 포함된 객체의 행위를 인식하고자 하는 영상을 말한다. 제1 기준 영상은 복수의 기준 영상들이 기저장된 데이터베이스부(200)로부터 입력될 수 있으며, 분석대상 영상과 비교되어 행위 인식 장치(10)가 분석대상 영상 내부에서 행동하는 객체의 행동을 인식할 수 있도록 하는 역할을 수행할 수 있다.An image to be analyzed may be input from the outside, and refers to an image in which the behavior recognition device 10 attempts to recognize the behavior of an object included in the image. The first reference image may be input from the database unit 200 in which a plurality of reference images are previously stored, and compared with the analysis target image, the behavior recognition device 10 can recognize the behavior of the object acting inside the analysis target image. It can play a role of enabling.

분석대상 영상의 특징들과 제1 기준 영상의 특징들을 프레임 단위로 추출하는 단계(S603)는, 입력 모듈(100)이 입력받은 분석대상 영상 및 제1 기준 영상으로부터 각 영상의 프레임 단위로 특징들을 추출하는 단계이다. In the step of extracting the features of the analysis target image and the features of the first reference image in frame units (S603), the features of each image are extracted from the analysis target image and the first reference image received by the input module 100 in a frame unit. This is the extraction step.

특징 추출 모듈(300)은 입력 모듈(100)로부터 입력받은 영상을 수신할 수 있다. 특징 추출 모듈(300)은 인공 신경망, 예컨대, DenseNET과 같은 심층 신경망 모델을 이용하여 각 영상의 프레임 단위로 특징들을 추출할 수 있다. 심층 신경망 모델을 이용하여 입력된 프레임으로부터 특징 맵들을 생성하여 특징을 추출할 수 있다. The feature extraction module 300 may receive an image input from the input module 100. The feature extraction module 300 may extract features in units of frames of each image using an artificial neural network, for example, a deep neural network model such as DenseNET. Feature maps can be generated from an input frame using a deep neural network model to extract features.

추출된 분석대상 영상의 특징들과 제1 기준 영상의 특징들 중 일부인 핵심 특징들을 각각 추출하는 단계(S604)는, 인공 신경망을 통해 추출된 영상의 특징들 중 K-평균 클러스터링 알고리즘을 이용하여 각각의 핵심 특징들을 추출하는 단계이다.Each of the extracted features of the analysis target image and the core features, which are some of the features of the first reference image, is extracted by using a K-means clustering algorithm among the features of the image extracted through an artificial neural network. This is the step of extracting the key features of.

특징 추출 모듈(300)은 심층 신경망 모델을 이용하여 각 영상들의 특징들을 추출한 후, 추출된 특징들 중에서 K-평균 클러스터링 알고리즘을 이용해 일부인 핵심 특징들을 추출할 수 있다. The feature extraction module 300 may extract features of each image using a deep neural network model, and then extract some of the core features using a K-means clustering algorithm among the extracted features.

각각 추출된 핵심 특징들 간의 유사도를 계산하는 단계(S605)는, 특징 추출 모듈(300)에서 추출한 핵심 특징들 간의 유사도를 계산하는 단계이다.The step of calculating the similarity between the extracted core features (S605) is a step of calculating the similarity between the core features extracted by the feature extraction module 300.

검증 모듈(400)은 특징 추출 모듈(300)에서 추출한 핵심 특징들에 대한 정보를 수신할 수 있다. 검증 모듈(400)은 수신한 각 영상의 핵심 특징들 간의 유사도를 계산할 수 있다. 각 영상마다 행위가 발생하는 구간이 다르며, 핵심 특징 추출 과정에서 영상의 시간 정보가 손실된 것을 보완하기 위하여 영상에서 행위가 발생한 구간에 속하는 핵심 특징들만을 선별하여 유사도를 계산할 수 있다. The verification module 400 may receive information on key features extracted by the feature extraction module 300. The verification module 400 may calculate a similarity between core features of each received image. Each image has a different section in which an action occurs, and in order to compensate for the loss of temporal information in the video during the core feature extraction process, the similarity can be calculated by selecting only the core features belonging to the section in which the action occurs.

제1 기준 영상을 제외한 나머지 기준 영상들에 대해 S603 내지 S605 단계를 반복하여 계산된 유사도 중 가장 큰 유사도를 갖는 기준 영상을 출력하는 단계(S606)는, 데이터베이스부(200)에 기저장된 제1 기준 영상을 제외한 나머지 기준 영상을 입력 모듈(100)이 입력받은 후 S603 내지 S605 단계를 반복하고, 계산된 유사도 중 가장 큰 유사도를 갖는 기준 영상을 출력하는 단계이다.The step (S606) of outputting a reference image having the largest similarity among the calculated similarities by repeating steps S603 to S605 for the rest of the reference images except the first reference image (S606) is a first reference previously stored in the database unit 200. After the input module 100 receives the reference images other than the image, steps S603 to S605 are repeated, and the reference image having the largest similarity among the calculated similarities is output.

본 발명의 일 실시예에 따른 행위 인식 장치(10)는 행위 분석대상 영상 내부의 객체에 대한 행위 인식을 위하여 데이터베이스부(200)에 저장된 복수의 기준 영상들 각각에 대해 분석대상 영상과의 유사도를 계산할 수 있다. 따라서, 분석대상 영상과 제1 기준 영상 간의 유사도 계산을 마친 후, 제1 기준 영상을 제외한 나머지 기준 영상과 분석대상 영상 간의 유사도 계산을 위하여 나머지 기준 영상을 입력 모듈(100)이 입력받고 위 단계들을 반복 수행할 수 있다. 위 단계들을 반복 수행한 후, 검증 모듈(400)은 계산된 유사도 중 가장 큰 유사도를 갖는 기준 영상을 출력하여 분석대상 영상 내부의 객체의 행위를 분류할 수 있다.The behavior recognition apparatus 10 according to an embodiment of the present invention calculates the similarity of each of the plurality of reference images stored in the database unit 200 with the analysis target image in order to recognize the behavior of the object inside the behavior analysis target image. Can be calculated. Therefore, after completing the calculation of the similarity between the analysis target image and the first reference image, the input module 100 receives the remaining reference images to calculate the similarity between the other reference image and the analysis target image, except for the first reference image, Can be performed repeatedly. After repeating the above steps, the verification module 400 may output a reference image having the largest similarity among the calculated similarities to classify the behavior of the object inside the image to be analyzed.

본 발명의 다른 실시예에 따른 행위 인식 방법은 분석대상 영상 및 복수의 기준 영상들 중 제1 기준 영상을 입력받는 단계(S602), 분석대상 영상의 특징들과 제1 기준 영상의 특징들을 프레임 단위로 추출하는 단계(S603), 추출된 분석대상 영상의 특징들과 제1 기준 영상의 특징들 중 일부인 핵심 특징들을 각각 추출하는 단계(S604) 및 각각 추출된 핵심 특징들 간의 유사도를 계산하는 단계(S605), 복수의 기준 영상들 중 상기 제1 기준 영상과 상이한 제2 기준 영상을 입력받는 단계, 분석대상 영상의 특징들과 제2 기준 영상의 특징들 간의 유사도를 계산하는 단계, 제1 기준 영상과 제2 기준 영상 중 계산된 유사도가 더 큰 기준 영상을 출력하는 단계 및 복수의 기준 영상들 중에서 나머지 영상들 중 하나의 기준 영상과 상기 분석대상 영상의 특징들 간의 유사도를 계산하여, 가장 큰 유사도를 갖는 기준 영상을 출력하는 단계를 포함할 수 있다. In the method of recognizing an action according to another embodiment of the present invention, the step of receiving an analysis target image and a first reference image from among a plurality of reference images (S602), and (S603), extracting features of the extracted image to be analyzed and core features that are some of the features of the first reference image (S604), and calculating similarity between the extracted core features ( S605), receiving a second reference image different from the first reference image among a plurality of reference images, calculating a similarity between the features of the analysis target image and the features of the second reference image, a first reference image And outputting a reference image having a higher calculated similarity among the second reference images, and calculating a similarity between one of the remaining images among a plurality of reference images and features of the analysis target image, thereby having the greatest similarity It may include the step of outputting a reference image having a.

본 발명의 일 실시예에 따른 행위 인식 방법은 컴퓨터 상에서 수행하기 위한 프로그램을 기록한 컴퓨터 판독 가능한 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능한 기록 매체는 컴퓨터에 의해 액세스(access)될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함할 수 있다. 또한, 컴퓨터 판독 가능한 기록 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독 가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함할 수 있다.The method for recognizing an action according to an embodiment of the present invention may be implemented in the form of a computer-readable recording medium in which a program to be executed on a computer is recorded. The computer-readable recording medium may be any available medium that can be accessed by a computer, and may include both volatile and nonvolatile media, and removable and non-removable media. Further, the computer-readable recording medium may include a computer storage medium. Computer storage media may include both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

본 명세서에서 설명된 실시예들에 관한 예시적인 모듈, 단계 또는 이들의 조합은 전자 하드웨어(코딩 등에 의해 설계되는 디지털 설계), 소프트웨어(프로그램 명령을 포함하는 다양한 형태의 애플리케이션) 또는 이들의 조합에 의해 구현될 수 있다. 하드웨어 및/또는 소프트웨어 중 어떠한 형태로 구현되는지는 사용자 단말에 부여되는 설계상의 제약에 따라 달라질 수 있다.Exemplary modules, steps, or a combination thereof according to the embodiments described herein may be performed by electronic hardware (digital design designed by coding, etc.), software (various types of applications including program instructions), or a combination thereof. Can be implemented. Which form of hardware and/or software is implemented may vary according to design constraints imposed on the user terminal.

본 명세서에서 설명된 구성의 하나 이상은 컴퓨터 프로그램 명령으로서 메모리에 저장될 수 있는데, 이러한 컴퓨터 프로그램 명령은 디지털 신호 프로세서를 중심으로 본 명세서에서 설명된 방법을 실행할 수 있다. 본 명세서에 첨부된 도면을 참조하여 특정되는 구성 간의 연결 예는 단지 예시적인 것으로, 이들 중 적어도 일부는 생략될 수도 있고, 반대로 이들 구성 뿐 아니라 추가적인 구성를 더 포함할 수 있음은 물론이다.One or more of the configurations described herein may be stored in a memory as computer program instructions, which computer program instructions may execute the methods described herein centered on a digital signal processor. Connection examples between configurations specified with reference to the accompanying drawings in the present specification are merely exemplary, and at least some of them may be omitted, and conversely, not only these configurations but also additional configurations may be further included.

이상의 설명은 본 발명의 기술적 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술적 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술적 사상의 범위가 한정되는 것이 아니다. 본 발명의 보호범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술적 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and those of ordinary skill in the art to which the present invention pertains will be able to make various modifications and variations without departing from the essential characteristics of the present invention. Accordingly, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention, but to describe it, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present invention.

10 : 행위 인식 장치
100 : 입력 모듈
200 : 데이터베이스부
300 : 특징 추출 모듈
400 : 검증 모듈
500 : 트레이닝 모듈10: behavior recognition device
100: input module
200: database unit
300: feature extraction module
400: verification module
500: training module

Claims

(a) receiving an analysis target image and a first reference image from among a plurality of reference images;
(b) extracting features of the analysis target image and features of the first reference image in a frame unit using an artificial neural network;
(c) extracting features of the extracted image to be analyzed and core features that are some of the features of the first reference image; And
(d) calculating a degree of similarity between the extracted core features,
How to recognize behavior.

The method of claim 1,
After step (d),
(e) receiving a second reference image different from the first reference image among the plurality of reference images;
(f) calculating a similarity between the features of the analysis target image and the features of the second reference image: and
(g) further comprising the step of outputting a reference image having a larger calculated similarity among the first and second reference images,
How to recognize behavior.

The method of claim 2,
After the step (g),
The method further comprising calculating a similarity between one of the remaining images among the plurality of reference images and features of the analysis target image, and outputting a reference image having the greatest similarity,
How to recognize behavior.

The method of claim 1,
After step (d),
The step of repeating steps (b), (c) and (d) with respect to the other reference images excluding the first reference image, further comprising outputting a reference image having the largest similarity among the calculated similarities,
How to recognize behavior.

The method of claim 1,
Before step (a),
It further includes a training step,
The training step,
Receiving an anchor image including behavior information, a positive image including behavior information identical to the anchor image, and a negative image including behavior information different from the anchor image;
Calculating a similarity between the anchor image and the positive image and a similarity between the anchor image and the negative image; And
Computing a loss function using the calculated similarities,
How to recognize behavior.

The method of claim 1,
In step (a),
The analysis target image is input from the outside,
The reference image is input from a database unit in which the reference image is previously stored,
How to recognize behavior.

The method of claim 1,
The step (b),
Generating a plurality of feature maps for each frame;
Blocking feature maps having the same size among the feature maps; And
Including the step of extracting features by receiving the output values of the feature maps of all the layers arranged before the feature map of each layer included in one block,
How to recognize behavior.

The method of claim 1,
The step (c),
Including the step of extracting key features using a K-means clustering (K-means clustering) algorithm,
How to recognize behavior.

The method of claim 1,
The step (d),
Comprising the step of calculating a similarity by selecting only the core features belonging to the section in which the action occurred in the analysis target image and the first reference image among the extracted key features,
How to recognize behavior.

A computer-readable recording medium recording a program for performing the method according to any one of claims 1 to 9 on a computer.

An input module for receiving an analysis target image and a reference image;
A feature extraction module for extracting features of the received analysis target image and features of the input reference image in a frame unit using an artificial neural network, and extracting core features, which are some of the extracted features, respectively; And
Comprising a verification module that calculates the similarity between the extracted core features, and outputs a reference image having the largest similarity among the reference images for which the similarity is calculated,
Behavior recognition device.

The method of claim 11,
Further comprising a database unit for storing the reference image,
Behavior recognition device.

The method of claim 11,
The anchor image and the positive calculated from an anchor image including behavior information, a positive image including behavior information identical to the anchor image, and a negative image including behavior information different from the anchor image Receiving the similarity between the images and the similarity between the anchor image and the negative image,
Further comprising a training module for training the behavior recognition method of claim 1 by calculating a loss function using the received similarities,
Behavior recognition device.

The method of claim 11,
The input module,
Receiving the image to be analyzed from the outside,
Receiving the reference image from the database unit,
Behavior recognition device.

The method of claim 11,
The feature extraction module,
All layers in which a plurality of feature maps are generated for each frame, feature maps having the same size among the feature maps are blocked, and feature maps of each layer included in one block are placed before it To extract features by receiving the output values of the feature maps of,
Behavior recognition device.

The method of claim 11,
The feature extraction module,
Extracting key features using K-means clustering algorithm,
Behavior recognition device.

The method of claim 11,
The verification module,
Among the extracted core features, calculating similarity by selecting only core features belonging to a section in which an action occurs in the analysis target image and the reference image,
Behavior recognition device.